Optimizing FTS Indexes, B-Tree Depth, and EAV Model Queries in SQLite


Issue Overview: Text Deduplication, Tokenization Tradeoffs, and EAV Query Optimization

This guide addresses three interconnected challenges when implementing NoSQL-like features in SQLite:

  1. Efficient text deduplication in Full-Text Search (FTS) tables while avoiding redundant storage of large strings.
  2. Balancing B-tree depth and tokenization efficiency for path-like attributes stored as text.
  3. Optimizing query performance for Entity-Attribute-Value (EAV) models requiring multi-attribute filtering via self-joins.

Each problem intersects with SQLite’s indexing mechanics, storage efficiency, and query planner behavior. We dissect the technical constraints and provide actionable solutions.


Possible Causes: FTS Index Limitations, B-Tree Structure, and Query Planner Behavior

1. FTS Virtual Table Constraints and Hash Index Alternatives

SQLite’s FTS virtual tables inherently lack support for secondary indexes. Attempting to create an index on alltexts(SipHash) fails because:

  • FTS tables are designed for inverted text indexing, not general-purpose data storage.
  • The SipHash column in an FTS table is stored as plain text (due to FTS type coercion), negating hash-based lookups.
  • Unique constraints via UNIQUE or PRIMARY KEY are unsupported in FTS tables, forcing duplication checks to rely on external logic.

The root cause is conflating FTS’s text-search capabilities with relational data integrity features. A separate deduplication mechanism is required.

2. B-Tree Depth vs. Tokenization Efficiency in Path Storage

Storing paths as human-readable strings (message.media.photo) versus encoded tokens (aa.bb.cc) involves tradeoffs:

  • Readability vs. Compactness: Longer strings increase B-tree node size, reducing fan-out and increasing tree depth. However, encoded tokens complicate debugging and query construction.
  • Tokenizer Interactions: FTS tokenizers split paths into lexemes. Shorter tokens (e.g., aa) may reduce lexeme variety, but longer descriptive strings improve query intuitiveness.
  • Merge Command Impact: FTS’s merge command consolidates inverted index B-trees. Larger B-trees post-merge might reduce fragmentation but don’t inherently degrade search speed if disk seeks are minimized.

The tension arises from conflicting goals: minimizing storage overhead while maximizing query flexibility.

3. EAV Model Query Performance and Index Utilization

EAV models require intersecting results from multiple attribute filters, often implemented via self-joins:

SELECT l1.object_id
FROM leaf_values l1
JOIN leaf_values l2 ON l1.object_id = l2.object_id
WHERE l1.attribute = 'attrib1' AND l1.value = 'value1'
  AND l2.attribute = 'attrib2' AND l2.value = 'value2';

Challenges include:

  • Single Index Per Table Limitation: SQLite’s query planner uses at most one index per table in a join. Without compound indexes spanning (attribute, value, object_id), self-joins may resort to full scans.
  • Skip-Scan Optimization Dependencies: Efficient use of compound indexes for partial matches (e.g., col_a = ?) requires ANALYZE statistics. CHECK constraints alone don’t provide cardinality hints.
  • No Native AND-by-INTERSECT: SQLite doesn’t combine multiple single-column indexes for AND conditions, unlike OR clauses optimized via UNION.

These limitations force developers to choose between wide compound indexes (with high write overhead) or dynamic query rewriting.


Troubleshooting Steps, Solutions & Fixes

1. Implementing Hash-Based Text Deduplication Without FTS Indexes

Step 1: Separate Text Storage from FTS

Store raw texts in a dedicated table with hash-based deduplication:

CREATE TABLE text_store (
  id INTEGER PRIMARY KEY,
  sip_hash BLOB UNIQUE,  -- Collision-resistant hash (e.g., SipHash-2-4)
  content TEXT COMPRESSED  -- Use SQLite’s BLOB compression
);

CREATE VIRTUAL TABLE text_fts USING fts4(content);
  • Insertion Workflow:
    1. Compute sip_hash of the input text.
    2. Attempt to insert (sip_hash, content) into text_store, relying on UNIQUE to reject duplicates.
    3. On conflict, retrieve the existing id.
    4. Insert the content into text_fts with the same rowid as text_store.id.

Step 2: Leverage SQLite’s Compression Extensions

Use zlib or zstd extensions to compress text_store.content:

-- Compress on insert
INSERT INTO text_store (sip_hash, content) 
VALUES (siphash_128(?), sqlite_compress(?));

-- Decompress on select
SELECT sqlite_decompress(content) FROM text_store WHERE id = ?;

This reduces storage for long texts while maintaining FTS performance, as text_fts contains uncompressed tokens.

Step 3: Hash Collision Handling

Although SipHash-1-3 has a 64-bit output, collisions are improbable for small datasets. For petabyte-scale archives:

  • Use a 128-bit hash (e.g., SipHash-2-4 with a 128-bit key).
  • Add a collision check during insertion:
INSERT INTO text_store (sip_hash, content)
SELECT ? AS sip_hash, ? AS content
WHERE NOT EXISTS (
  SELECT 1 FROM text_store 
  WHERE sip_hash = ? AND sqlite_decompress(content) = ?
);

2. Optimizing Path Storage for B-Tree and FTS Efficiency

Step 1: Hybrid Encoding for Path Components

Encode path components as integers stored in a lookup table, but retain human-readable aliases for queries:

CREATE TABLE path_components (
  id INTEGER PRIMARY KEY,
  name TEXT UNIQUE  -- e.g., "message", "media"
);

-- Example path: message.media.photo → 1.5.12
CREATE TABLE object_paths (
  object_id INTEGER,
  path_ids TEXT,  -- Dot-separated component IDs
  path_text TEXT,  -- Generated column for debugging
  FOREIGN KEY (object_id) REFERENCES objects(id)
);

CREATE VIRTUAL TABLE path_fts USING fts4(path_text);
  • Tradeoff: path_ids minimizes storage and B-tree depth, while path_text (generated via JOIN with path_components) aids in debugging.

Step 2: Tokenizer Configuration for Encoded Paths

Use a custom tokenizer to split path_ids into numeric components:

-- Register a tokenizer that splits on '.' and treats IDs as tokens
CREATE VIRTUAL TABLE path_fts USING fts4(
  path_ids, 
  tokenize=unicode61 "separators=. tokenchars=0123456789"
);

This allows FTS queries like:

SELECT object_id FROM path_fts WHERE path_ids MATCH '1 5';  -- Components 1 AND 5

Step 3: B-Tree Depth Mitigation

  • Shorter Keys: Using INTEGER identifiers instead of TEXT reduces index key size. A 4-byte integer can represent 4 billion unique components.
  • Prefix Compression: For path_ids stored as TEXT, enable SQLITE_ENABLE_ICU and use a collation that sorts numeric components lexically (e.g., CAST(id AS TEXT)).

Benchmark with EXPLAIN QUERY PLAN to compare index usage:

EXPLAIN QUERY PLAN
SELECT object_id FROM object_paths WHERE path_ids LIKE '1.5.%';

3. Optimizing EAV Queries with Self-Joins and Compound Indexes

Step 1: Index Strategy for Leaf Values

Create a covering index for attribute-value lookups:

CREATE INDEX leaf_attr_val ON leaf_values(attribute, value, object_id);

This allows index-only scans for queries filtering on attribute and value.

Step 2: Query Rewriting for Intersection

Instead of self-joins, use INTERSECT for AND conditions:

SELECT object_id FROM leaf_values WHERE attribute = 'a' AND value = '1'
INTERSECT
SELECT object_id FROM leaf_values WHERE attribute = 'b' AND value = '2';
  • Advantage: Each subquery uses the leaf_attr_val index, and INTERSECT eliminates duplicates in memory.
  • Disadvantage: Manual query construction; consider dynamic SQL generation.

Step 3: Materialized Attribute Views

For frequently filtered attributes, create materialized views:

CREATE TABLE object_attr_a (
  object_id INTEGER PRIMARY KEY,
  value INTEGER,
  FOREIGN KEY (object_id) REFERENCES objects(id)
);

CREATE INDEX attr_a_value ON object_attr_a(value);
  • Tradeoff: Increases write overhead but enables single-table queries for common attributes.

Step 4: Leverage ANALYZE for Skip-Scan

Run ANALYZE to enable skip-scan optimization:

ANALYZE;

Verify statistics with:

SELECT * FROM sqlite_stat1 WHERE tbl = 'leaf_values';

Ensure high nDistinct values for attribute to favor skip-scans.

Step 5: Partial Indexes for Sparse Attributes

For rarely used attributes, use partial indexes:

CREATE INDEX leaf_attr_c ON leaf_values(object_id)
WHERE attribute = 'c' AND value > 100;

Reduces index size and write latency for infrequently queried attributes.


By addressing these areas with a combination of schema redesign, query rewriting, and strategic indexing, developers can implement efficient NoSQL-like features in SQLite while respecting its inherent constraints.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *