Optimizing FTS Indexes, B-Tree Depth, and EAV Model Queries in SQLite
Issue Overview: Text Deduplication, Tokenization Tradeoffs, and EAV Query Optimization
This guide addresses three interconnected challenges when implementing NoSQL-like features in SQLite:
- Efficient text deduplication in Full-Text Search (FTS) tables while avoiding redundant storage of large strings.
- Balancing B-tree depth and tokenization efficiency for path-like attributes stored as text.
- Optimizing query performance for Entity-Attribute-Value (EAV) models requiring multi-attribute filtering via self-joins.
Each problem intersects with SQLite’s indexing mechanics, storage efficiency, and query planner behavior. We dissect the technical constraints and provide actionable solutions.
Possible Causes: FTS Index Limitations, B-Tree Structure, and Query Planner Behavior
1. FTS Virtual Table Constraints and Hash Index Alternatives
SQLite’s FTS virtual tables inherently lack support for secondary indexes. Attempting to create an index on alltexts(SipHash)
fails because:
- FTS tables are designed for inverted text indexing, not general-purpose data storage.
- The
SipHash
column in an FTS table is stored as plain text (due to FTS type coercion), negating hash-based lookups. - Unique constraints via
UNIQUE
orPRIMARY KEY
are unsupported in FTS tables, forcing duplication checks to rely on external logic.
The root cause is conflating FTS’s text-search capabilities with relational data integrity features. A separate deduplication mechanism is required.
2. B-Tree Depth vs. Tokenization Efficiency in Path Storage
Storing paths as human-readable strings (message.media.photo
) versus encoded tokens (aa.bb.cc
) involves tradeoffs:
- Readability vs. Compactness: Longer strings increase B-tree node size, reducing fan-out and increasing tree depth. However, encoded tokens complicate debugging and query construction.
- Tokenizer Interactions: FTS tokenizers split paths into lexemes. Shorter tokens (e.g.,
aa
) may reduce lexeme variety, but longer descriptive strings improve query intuitiveness. - Merge Command Impact: FTS’s
merge
command consolidates inverted index B-trees. Larger B-trees post-merge might reduce fragmentation but don’t inherently degrade search speed if disk seeks are minimized.
The tension arises from conflicting goals: minimizing storage overhead while maximizing query flexibility.
3. EAV Model Query Performance and Index Utilization
EAV models require intersecting results from multiple attribute filters, often implemented via self-joins:
SELECT l1.object_id
FROM leaf_values l1
JOIN leaf_values l2 ON l1.object_id = l2.object_id
WHERE l1.attribute = 'attrib1' AND l1.value = 'value1'
AND l2.attribute = 'attrib2' AND l2.value = 'value2';
Challenges include:
- Single Index Per Table Limitation: SQLite’s query planner uses at most one index per table in a join. Without compound indexes spanning
(attribute, value, object_id)
, self-joins may resort to full scans. - Skip-Scan Optimization Dependencies: Efficient use of compound indexes for partial matches (e.g.,
col_a = ?
) requiresANALYZE
statistics.CHECK
constraints alone don’t provide cardinality hints. - No Native AND-by-INTERSECT: SQLite doesn’t combine multiple single-column indexes for AND conditions, unlike OR clauses optimized via UNION.
These limitations force developers to choose between wide compound indexes (with high write overhead) or dynamic query rewriting.
Troubleshooting Steps, Solutions & Fixes
1. Implementing Hash-Based Text Deduplication Without FTS Indexes
Step 1: Separate Text Storage from FTS
Store raw texts in a dedicated table with hash-based deduplication:
CREATE TABLE text_store (
id INTEGER PRIMARY KEY,
sip_hash BLOB UNIQUE, -- Collision-resistant hash (e.g., SipHash-2-4)
content TEXT COMPRESSED -- Use SQLite’s BLOB compression
);
CREATE VIRTUAL TABLE text_fts USING fts4(content);
- Insertion Workflow:
- Compute
sip_hash
of the input text. - Attempt to insert
(sip_hash, content)
intotext_store
, relying onUNIQUE
to reject duplicates. - On conflict, retrieve the existing
id
. - Insert the
content
intotext_fts
with the samerowid
astext_store.id
.
- Compute
Step 2: Leverage SQLite’s Compression Extensions
Use zlib
or zstd
extensions to compress text_store.content
:
-- Compress on insert
INSERT INTO text_store (sip_hash, content)
VALUES (siphash_128(?), sqlite_compress(?));
-- Decompress on select
SELECT sqlite_decompress(content) FROM text_store WHERE id = ?;
This reduces storage for long texts while maintaining FTS performance, as text_fts
contains uncompressed tokens.
Step 3: Hash Collision Handling
Although SipHash-1-3 has a 64-bit output, collisions are improbable for small datasets. For petabyte-scale archives:
- Use a 128-bit hash (e.g., SipHash-2-4 with a 128-bit key).
- Add a collision check during insertion:
INSERT INTO text_store (sip_hash, content)
SELECT ? AS sip_hash, ? AS content
WHERE NOT EXISTS (
SELECT 1 FROM text_store
WHERE sip_hash = ? AND sqlite_decompress(content) = ?
);
2. Optimizing Path Storage for B-Tree and FTS Efficiency
Step 1: Hybrid Encoding for Path Components
Encode path components as integers stored in a lookup table, but retain human-readable aliases for queries:
CREATE TABLE path_components (
id INTEGER PRIMARY KEY,
name TEXT UNIQUE -- e.g., "message", "media"
);
-- Example path: message.media.photo → 1.5.12
CREATE TABLE object_paths (
object_id INTEGER,
path_ids TEXT, -- Dot-separated component IDs
path_text TEXT, -- Generated column for debugging
FOREIGN KEY (object_id) REFERENCES objects(id)
);
CREATE VIRTUAL TABLE path_fts USING fts4(path_text);
- Tradeoff:
path_ids
minimizes storage and B-tree depth, whilepath_text
(generated viaJOIN
withpath_components
) aids in debugging.
Step 2: Tokenizer Configuration for Encoded Paths
Use a custom tokenizer to split path_ids
into numeric components:
-- Register a tokenizer that splits on '.' and treats IDs as tokens
CREATE VIRTUAL TABLE path_fts USING fts4(
path_ids,
tokenize=unicode61 "separators=. tokenchars=0123456789"
);
This allows FTS queries like:
SELECT object_id FROM path_fts WHERE path_ids MATCH '1 5'; -- Components 1 AND 5
Step 3: B-Tree Depth Mitigation
- Shorter Keys: Using
INTEGER
identifiers instead ofTEXT
reduces index key size. A 4-byte integer can represent 4 billion unique components. - Prefix Compression: For
path_ids
stored asTEXT
, enableSQLITE_ENABLE_ICU
and use a collation that sorts numeric components lexically (e.g.,CAST(id AS TEXT)
).
Benchmark with EXPLAIN QUERY PLAN
to compare index usage:
EXPLAIN QUERY PLAN
SELECT object_id FROM object_paths WHERE path_ids LIKE '1.5.%';
3. Optimizing EAV Queries with Self-Joins and Compound Indexes
Step 1: Index Strategy for Leaf Values
Create a covering index for attribute-value lookups:
CREATE INDEX leaf_attr_val ON leaf_values(attribute, value, object_id);
This allows index-only scans for queries filtering on attribute
and value
.
Step 2: Query Rewriting for Intersection
Instead of self-joins, use INTERSECT
for AND conditions:
SELECT object_id FROM leaf_values WHERE attribute = 'a' AND value = '1'
INTERSECT
SELECT object_id FROM leaf_values WHERE attribute = 'b' AND value = '2';
- Advantage: Each subquery uses the
leaf_attr_val
index, andINTERSECT
eliminates duplicates in memory. - Disadvantage: Manual query construction; consider dynamic SQL generation.
Step 3: Materialized Attribute Views
For frequently filtered attributes, create materialized views:
CREATE TABLE object_attr_a (
object_id INTEGER PRIMARY KEY,
value INTEGER,
FOREIGN KEY (object_id) REFERENCES objects(id)
);
CREATE INDEX attr_a_value ON object_attr_a(value);
- Tradeoff: Increases write overhead but enables single-table queries for common attributes.
Step 4: Leverage ANALYZE for Skip-Scan
Run ANALYZE
to enable skip-scan optimization:
ANALYZE;
Verify statistics with:
SELECT * FROM sqlite_stat1 WHERE tbl = 'leaf_values';
Ensure high nDistinct
values for attribute
to favor skip-scans.
Step 5: Partial Indexes for Sparse Attributes
For rarely used attributes, use partial indexes:
CREATE INDEX leaf_attr_c ON leaf_values(object_id)
WHERE attribute = 'c' AND value > 100;
Reduces index size and write latency for infrequently queried attributes.
By addressing these areas with a combination of schema redesign, query rewriting, and strategic indexing, developers can implement efficient NoSQL-like features in SQLite while respecting its inherent constraints.