FTS5 Performance Degradation with “ver1.1” Queries vs. “ver1”
Understanding FTS5 Query Performance Differences Between Versioned Search Terms
The core issue revolves around significant performance degradation when executing FTS5 full-text search queries containing version strings like "ver1.1" compared to simpler patterns like "ver1". In one observed scenario, query execution time increased by 1,000x when using the "ver1.1" pattern. This problem arises from fundamental differences in how FTS5 tokenizes search terms, processes prefix queries, and manages memory during query execution. The observed behavior is not a bug but a consequence of FTS5’s architecture interacting with specific query patterns.
FTS5 operates using inverted indexes that map tokens to their locations within indexed documents. The default Unicode61 tokenizer splits text at word boundaries and punctuation marks, treating periods (.) as separator characters. When searching for "ver1.1", the tokenizer parses this as two separate tokens: "ver1" and "1". The query syntax implicitly translates to "ver1" + "1*"
, where +
represents a phrase search requiring both tokens to appear in sequence. The asterisk in "1*" triggers a prefix query that must scan all tokens starting with "1", creating massive computational overhead when dealing with large datasets.
Key performance metrics from SQLite’s -stats
output reveal critical differences:
- Memory Allocation: The "ver1.1" query allocated 268,435,456 bytes vs. 87,360 bytes for "ver1"
- Pcache Overflow: 2,022,120 bytes vs. 340,440 bytes
- Page Cache Misses: 95,276 vs. 97
These metrics indicate that the prefix query on "1*" forces FTS5 to load enormous doclists (document identifier lists) into memory, overwhelming SQLite’s page cache and requiring excessive heap allocations. The problem compounds when:
- The prefix term ("1") appears frequently in indexed content
- No prefix indexes exist to optimize lookup
- Tokenization rules split version strings into suboptimal components
Root Causes of FTS5 Slowdown with Versioned Search Patterns
1. Tokenization Rules Splitting Version Identifiers
The default Unicode61 tokenizer treats periods as separator characters, splitting "ver1.1" into "ver1" and "1". This forces compound queries that require both exact matches and prefix searches. Three factors exacerbate this:
- Phrase Proximity Requirements: The
+
operator mandates that "ver1" and tokens starting with "1" appear consecutively - Unoptimized Prefix Scans: Without prefix indexes, FTS5 must scan all tokens matching "1*" across the entire corpus
- Doclist Bloat: Frequent prefix terms generate massive doclists that exceed available memory buffers
2. Inefficient Handling of Mixed Exact/Prefix Queries
The query col1 MATCH 42 AND my_fts5 MATCH '("ver1.1"*)'
combines a numeric match with a complex full-text operation. FTS5 processes these as sequential filters:
- Execute
col1 MATCH 42
to find candidate rows - For each candidate, verify
my_fts5 MATCH '("ver1.1"*)'
The second step becomes a bottleneck because:
- The "ver1.1" query cannot leverage indexes until after filtering by
col1
- Intermediate results aren’t cached between query plan stages
- Large doclists from "1*" require repeated scans during row verification
3. Memory Management Under High-Pressure Scenarios
SQLite’s memory statistics reveal critical pressure points:
- Largest Allocation: 256MB for "ver1.1" vs 85KB for "ver1"
- Pcache Overflow: Indicates page cache couldn’t hold frequently accessed data
- Lookaside Failures: 445 OOM errors suggest memory fragmentation
These occur because:
- FTS5 loads entire doclists for prefix terms into contiguous memory blocks
- SQLite’s default page cache (controlled by
-DSQLITE_DEFAULT_PCACHE_INITSZ
) isn’t sized for terabyte-scale doclists - Repeated allocations/frees for large doclists fragment the heap
Optimizing FTS5 Schema and Query Execution for Version Strings
1. Prefix Index Optimization
Solution: Add prefix indexes to accelerate "1*"-style queries
CREATE VIRTUAL TABLE my_fts5 USING fts5(
content_rowid='id',
content='my',
prefix='1', -- Single-character prefix index
/* other columns */
);
Mechanics:
- Creates supplemental indexes for 1-character token prefixes
- Reduces doclist scanning by pre-indexing initial characters
- Tradeoff: 5-10% storage overhead per prefix length
Verification:
Monitor pcache
metrics after implementation:
sqlite3 -stats db.sqlite "SELECT * FROM my_fts5 WHERE my_fts5 MATCH 'ver1.1*'"
Expect reductions in:
Largest Allocation
(target <100MB)Page cache misses
(target <1,000)
Caveats:
- Multi-character prefixes (
prefix='1,2'
) increase index size exponentially - Rebuild required after schema modification
2. Tokenizer Customization for Version Recognition
Solution: Treat periods as token characters to preserve version strings
CREATE VIRTUAL TABLE my_fts5 USING fts5(
content_rowid='id',
content='my',
tokenize="unicode61 tokenchars '.'",
/* other columns */
);
Impact:
- "ver1.1" becomes a single token instead of ["ver1", "1"]
- Queries for "ver1.1*" scan a smaller, targeted doclist
- Version comparisons become exact matches rather than phrase searches
Testing Protocol:
- Rebuild FTS5 table with new tokenizer
- Execute comparative queries:
-- Old behavior (split tokens) EXPLAIN QUERY PLAN SELECT * FROM my_fts5 WHERE my_fts5 MATCH 'ver1 + 1*'; -- New behavior (single token) EXPLAIN QUERY PLAN SELECT * FROM my_fts5 WHERE my_fts5 MATCH 'ver1.1*';
Validate that the second query uses SEARCH TABLE my_fts5 USING INDEX
instead of SCAN
.
Backward Compatibility:
- Existing queries using "ver1" without ".1" will require wildcards:
ver1*
- Phrase searches across version components become impossible
3. Custom Tokenizer Implementation
For granular control over version parsing, implement a C/C++ tokenizer:
#include <sqlite3.h>
#include <fts5.h>
static int xTokenize(
Fts5Tokenizer *pTokenizer,
void *pCtx,
int flags,
const char *pText, int nText,
int (*xToken)(void*, int, const char*, int, int, int)
){
// Logic to parse "ver1.1" as single token
const char *curr = pText;
while(*curr){
if(is_version_prefix(curr)){
xToken(pCtx, 0, "ver1.1", 6, 0, 6);
curr += 6;
} else {
// Default tokenization
}
}
return SQLITE_OK;
}
SQLITE_API int sqlite3_my_tokenizer_init(
sqlite3 *db,
char **pzErrMsg,
const sqlite3_api_routines *pApi
){
static const fts5_tokenizer t = {xTokenize, NULL, NULL};
return fts5_register_tokenizer(db, "version_tokenizer", &t, NULL);
}
Integration:
CREATE VIRTUAL TABLE my_fts5 USING fts5(
content_rowid='id',
content='my',
tokenize='version_tokenizer',
/* other columns */
);
Advantages:
- Exact version matching without wildcards
- No doclist bloat from partial prefixes
- Custom logic for semantic versioning (e.g., "ver1.1.2-beta")
Disadvantages:
- Requires native code compilation
- Maintenance overhead for tokenizer updates
4. Query Rewriting and Data Preprocessing
For environments where schema changes are impractical:
Step 1: Normalize version strings during ingestion
-- Add preprocessed column
ALTER TABLE my ADD COLUMN version_clean TEXT;
-- Update trigger
CREATE TRIGGER version_normalize BEFORE INSERT ON my BEGIN
SET NEW.version_clean = REPLACE(NEW.version, '.', '_');
END;
-- FTS5 population
INSERT INTO my_fts5(rowid, version_clean)
SELECT rowid, version_clean FROM my;
Step 2: Rewrite queries to use normalized forms
SELECT * FROM my_fts5
WHERE my_fts5 MATCH 'ver1_1*'
AND col1 = 42;
Step 3: Map results back to original versions
# Post-processing in application code
results = execute_sql("...")
for row in results:
row['version'] = row['version_clean'].replace('_', '.')
Performance Gains:
- Eliminates prefix scans by using exact matches
- Reduces doclist sizes through controlled tokenization
- Avoids schema changes by leveraging application-layer processing
5. Memory Configuration Tuning
When large doclists are unavoidable, adjust SQLite’s memory parameters:
Increase Page Cache Size:
sqlite3_config(SQLITE_CONFIG_PAGECACHE, malloc(1024 * 1024 * 100), 4096, 25000);
Allocates 100MB page cache with 4KB pages.
Adjust Lookaside Allocation:
PRAGMA lookaside_size=8000; -- 8KB per slot
PRAGMA lookaside_slots=500; -- 4MB total
Monitor with PRAGMA Statements:
PRAGMA stats;
PRAGMA memory_status;
Tradeoffs:
- Higher memory consumption per connection
- Risk of OOM errors in memory-constrained environments
6. Hybrid Indexing Strategies
Combine FTS5 with traditional indexes for compound queries:
Index on Numeric Filter:
CREATE INDEX idx_my_col1 ON my(col1);
Revised Query Plan:
SELECT * FROM my
WHERE rowid IN (
SELECT rowid FROM my_fts5 WHERE my_fts5 MATCH 'ver1.1*'
)
AND col1 = 42;
Execution Flow:
- Use FTS5 index to find "ver1.1*" matches
- Use B-tree index on
col1
to filter results - Join via rowid intersection
Optimization Insights:
- Reduces full-text search result set early
- Leverages rowid joins for fast intersections
- Requires covering indexes on both FTS5 and relational columns
7. Progressive Materialized Views
For frequently searched version patterns:
Materialize Common Queries:
CREATE TABLE mat_ver1_1 AS
SELECT rowid, * FROM my_fts5 WHERE my_fts5 MATCH 'ver1.1*';
CREATE INDEX idx_mat_ver1_1 ON mat_ver1_1(col1);
Query Routing:
def search_ver1_1(col1_val):
if is_common_query(col1_val):
return execute("SELECT * FROM mat_ver1_1 WHERE col1=?", col1_val)
else:
return execute("SELECT * FROM my_fts5 ...")
Refresh Strategy:
CREATE TRIGGER mat_refresh AFTER UPDATE ON my BEGIN
DELETE FROM mat_ver1_1 WHERE rowid = OLD.rowid;
INSERT INTO mat_ver1_1 SELECT NEW.*;
END;
Considerations:
- Ideal for read-heavy workloads
- Requires trigger maintenance
- Storage overhead proportional to materialized data
8. Version-Specific Partitioning
Split data by major versions using partial indexes:
Table Partitioning:
CREATE TABLE my_ver1 (
id INTEGER PRIMARY KEY,
col1 INT,
...
);
CREATE VIRTUAL TABLE my_fts5_ver1 USING fts5(...);
Query Routing:
SELECT * FROM my_ver1
WHERE col1 = 42
AND my_fts5_ver1 MATCH 'ver1.1*';
Benefits:
- Reduces FTS5 index size per partition
- Enables version-specific optimizations
- Simplifies archival of obsolete versions
9. Statistical Profiling with EXPLAIN
Analyze query plans to identify bottlenecks:
Execution Plan Analysis:
EXPLAIN QUERY PLAN
SELECT * FROM my_fts5
WHERE col1 = 42 AND my_fts5 MATCH 'ver1.1*';
Interpretation:
SCAN TABLE my_fts5 VIRTUAL TABLE INDEX 0:M
indicates full-text scanSEARCH TABLE my USING INDEX idx_col1
shows B-tree usageUSE TEMP B-TREE FOR ORDER BY
suggests inefficient sorting
Optimization Targets:
- Eliminate temporary B-trees through index-only scans
- Convert SCAN to SEARCH in FTS5 using prefix indexes
- Reorder WHERE clauses to leverage indexed columns first
10. FTS5 External Content Optimization
For large content tables, offload storage:
Schema Configuration:
CREATE VIRTUAL TABLE my_fts5 USING fts5(
content_rowid='id',
content='my',
prefix='1',
tokenize="unicode61"
);
Content Synchronization:
-- Enable external content
INSERT INTO my_fts5(my_fts5, rank) VALUES('rebuild', 1);
Behavior:
- FTS5 stores only tokens, not original content
- Original data retrieved via
content='my'
join - Reduces FTS5 storage overhead by 40-60%
Tradeoffs:
- Requires strict trigger-based synchronization
rebuild
operations are expensive- Queries slower without
contentless=1
optimization
Final Recommendations
- Immediate Mitigation: Implement prefix indexes with
prefix=1
and monitorpcache
metrics - Medium-Term Solution: Custom tokenizer treating ‘.’ as token character
- Long-Term Strategy: Version-aware schema partitioning with materialized views
Benchmark each approach using real-world data volumes and query patterns. Use SQLite’s -stats
output to validate memory usage reductions and query plan improvements. For mission-critical systems, combine FTS5 optimizations with memory configuration tuning to handle large doclists efficiently.