FTS5 Performance Degradation with “ver1.1” Queries vs. “ver1”

Understanding FTS5 Query Performance Differences Between Versioned Search Terms

The core issue revolves around significant performance degradation when executing FTS5 full-text search queries containing version strings like "ver1.1" compared to simpler patterns like "ver1". In one observed scenario, query execution time increased by 1,000x when using the "ver1.1" pattern. This problem arises from fundamental differences in how FTS5 tokenizes search terms, processes prefix queries, and manages memory during query execution. The observed behavior is not a bug but a consequence of FTS5’s architecture interacting with specific query patterns.

FTS5 operates using inverted indexes that map tokens to their locations within indexed documents. The default Unicode61 tokenizer splits text at word boundaries and punctuation marks, treating periods (.) as separator characters. When searching for "ver1.1", the tokenizer parses this as two separate tokens: "ver1" and "1". The query syntax implicitly translates to "ver1" + "1*", where + represents a phrase search requiring both tokens to appear in sequence. The asterisk in "1*" triggers a prefix query that must scan all tokens starting with "1", creating massive computational overhead when dealing with large datasets.

Key performance metrics from SQLite’s -stats output reveal critical differences:

  • Memory Allocation: The "ver1.1" query allocated 268,435,456 bytes vs. 87,360 bytes for "ver1"
  • Pcache Overflow: 2,022,120 bytes vs. 340,440 bytes
  • Page Cache Misses: 95,276 vs. 97

These metrics indicate that the prefix query on "1*" forces FTS5 to load enormous doclists (document identifier lists) into memory, overwhelming SQLite’s page cache and requiring excessive heap allocations. The problem compounds when:

  1. The prefix term ("1") appears frequently in indexed content
  2. No prefix indexes exist to optimize lookup
  3. Tokenization rules split version strings into suboptimal components

Root Causes of FTS5 Slowdown with Versioned Search Patterns

1. Tokenization Rules Splitting Version Identifiers

The default Unicode61 tokenizer treats periods as separator characters, splitting "ver1.1" into "ver1" and "1". This forces compound queries that require both exact matches and prefix searches. Three factors exacerbate this:

  • Phrase Proximity Requirements: The + operator mandates that "ver1" and tokens starting with "1" appear consecutively
  • Unoptimized Prefix Scans: Without prefix indexes, FTS5 must scan all tokens matching "1*" across the entire corpus
  • Doclist Bloat: Frequent prefix terms generate massive doclists that exceed available memory buffers

2. Inefficient Handling of Mixed Exact/Prefix Queries

The query col1 MATCH 42 AND my_fts5 MATCH '("ver1.1"*)' combines a numeric match with a complex full-text operation. FTS5 processes these as sequential filters:

  1. Execute col1 MATCH 42 to find candidate rows
  2. For each candidate, verify my_fts5 MATCH '("ver1.1"*)'

The second step becomes a bottleneck because:

  • The "ver1.1" query cannot leverage indexes until after filtering by col1
  • Intermediate results aren’t cached between query plan stages
  • Large doclists from "1*" require repeated scans during row verification

3. Memory Management Under High-Pressure Scenarios

SQLite’s memory statistics reveal critical pressure points:

  • Largest Allocation: 256MB for "ver1.1" vs 85KB for "ver1"
  • Pcache Overflow: Indicates page cache couldn’t hold frequently accessed data
  • Lookaside Failures: 445 OOM errors suggest memory fragmentation

These occur because:

  • FTS5 loads entire doclists for prefix terms into contiguous memory blocks
  • SQLite’s default page cache (controlled by -DSQLITE_DEFAULT_PCACHE_INITSZ) isn’t sized for terabyte-scale doclists
  • Repeated allocations/frees for large doclists fragment the heap

Optimizing FTS5 Schema and Query Execution for Version Strings

1. Prefix Index Optimization

Solution: Add prefix indexes to accelerate "1*"-style queries

CREATE VIRTUAL TABLE my_fts5 USING fts5(
  content_rowid='id', 
  content='my',
  prefix='1',  -- Single-character prefix index
  /* other columns */
);

Mechanics:

  • Creates supplemental indexes for 1-character token prefixes
  • Reduces doclist scanning by pre-indexing initial characters
  • Tradeoff: 5-10% storage overhead per prefix length

Verification:
Monitor pcache metrics after implementation:

sqlite3 -stats db.sqlite "SELECT * FROM my_fts5 WHERE my_fts5 MATCH 'ver1.1*'"

Expect reductions in:

  • Largest Allocation (target <100MB)
  • Page cache misses (target <1,000)

Caveats:

  • Multi-character prefixes (prefix='1,2') increase index size exponentially
  • Rebuild required after schema modification

2. Tokenizer Customization for Version Recognition

Solution: Treat periods as token characters to preserve version strings

CREATE VIRTUAL TABLE my_fts5 USING fts5(
  content_rowid='id',
  content='my',
  tokenize="unicode61 tokenchars '.'",
  /* other columns */
);

Impact:

  • "ver1.1" becomes a single token instead of ["ver1", "1"]
  • Queries for "ver1.1*" scan a smaller, targeted doclist
  • Version comparisons become exact matches rather than phrase searches

Testing Protocol:

  1. Rebuild FTS5 table with new tokenizer
  2. Execute comparative queries:
    -- Old behavior (split tokens)
    EXPLAIN QUERY PLAN 
    SELECT * FROM my_fts5 WHERE my_fts5 MATCH 'ver1 + 1*';
    
    -- New behavior (single token)
    EXPLAIN QUERY PLAN
    SELECT * FROM my_fts5 WHERE my_fts5 MATCH 'ver1.1*';
    

Validate that the second query uses SEARCH TABLE my_fts5 USING INDEX instead of SCAN.

Backward Compatibility:

  • Existing queries using "ver1" without ".1" will require wildcards: ver1*
  • Phrase searches across version components become impossible

3. Custom Tokenizer Implementation

For granular control over version parsing, implement a C/C++ tokenizer:

#include <sqlite3.h>
#include <fts5.h>

static int xTokenize(
  Fts5Tokenizer *pTokenizer,
  void *pCtx,
  int flags,
  const char *pText, int nText,
  int (*xToken)(void*, int, const char*, int, int, int)
){
  // Logic to parse "ver1.1" as single token
  const char *curr = pText;
  while(*curr){
    if(is_version_prefix(curr)){
      xToken(pCtx, 0, "ver1.1", 6, 0, 6);
      curr += 6;
    } else {
      // Default tokenization
    }
  }
  return SQLITE_OK;
}

SQLITE_API int sqlite3_my_tokenizer_init(
  sqlite3 *db, 
  char **pzErrMsg, 
  const sqlite3_api_routines *pApi
){
  static const fts5_tokenizer t = {xTokenize, NULL, NULL};
  return fts5_register_tokenizer(db, "version_tokenizer", &t, NULL);
}

Integration:

CREATE VIRTUAL TABLE my_fts5 USING fts5(
  content_rowid='id',
  content='my',
  tokenize='version_tokenizer',
  /* other columns */
);

Advantages:

  • Exact version matching without wildcards
  • No doclist bloat from partial prefixes
  • Custom logic for semantic versioning (e.g., "ver1.1.2-beta")

Disadvantages:

  • Requires native code compilation
  • Maintenance overhead for tokenizer updates

4. Query Rewriting and Data Preprocessing

For environments where schema changes are impractical:

Step 1: Normalize version strings during ingestion

-- Add preprocessed column
ALTER TABLE my ADD COLUMN version_clean TEXT;

-- Update trigger
CREATE TRIGGER version_normalize BEFORE INSERT ON my BEGIN
  SET NEW.version_clean = REPLACE(NEW.version, '.', '_');
END;

-- FTS5 population
INSERT INTO my_fts5(rowid, version_clean) 
SELECT rowid, version_clean FROM my;

Step 2: Rewrite queries to use normalized forms

SELECT * FROM my_fts5 
WHERE my_fts5 MATCH 'ver1_1*' 
  AND col1 = 42;

Step 3: Map results back to original versions

# Post-processing in application code
results = execute_sql("...")
for row in results:
    row['version'] = row['version_clean'].replace('_', '.')

Performance Gains:

  • Eliminates prefix scans by using exact matches
  • Reduces doclist sizes through controlled tokenization
  • Avoids schema changes by leveraging application-layer processing

5. Memory Configuration Tuning

When large doclists are unavoidable, adjust SQLite’s memory parameters:

Increase Page Cache Size:

sqlite3_config(SQLITE_CONFIG_PAGECACHE, malloc(1024 * 1024 * 100), 4096, 25000);

Allocates 100MB page cache with 4KB pages.

Adjust Lookaside Allocation:

PRAGMA lookaside_size=8000;  -- 8KB per slot
PRAGMA lookaside_slots=500;  -- 4MB total

Monitor with PRAGMA Statements:

PRAGMA stats;
PRAGMA memory_status;

Tradeoffs:

  • Higher memory consumption per connection
  • Risk of OOM errors in memory-constrained environments

6. Hybrid Indexing Strategies

Combine FTS5 with traditional indexes for compound queries:

Index on Numeric Filter:

CREATE INDEX idx_my_col1 ON my(col1);

Revised Query Plan:

SELECT * FROM my 
WHERE rowid IN (
  SELECT rowid FROM my_fts5 WHERE my_fts5 MATCH 'ver1.1*'
) 
AND col1 = 42;

Execution Flow:

  1. Use FTS5 index to find "ver1.1*" matches
  2. Use B-tree index on col1 to filter results
  3. Join via rowid intersection

Optimization Insights:

  • Reduces full-text search result set early
  • Leverages rowid joins for fast intersections
  • Requires covering indexes on both FTS5 and relational columns

7. Progressive Materialized Views

For frequently searched version patterns:

Materialize Common Queries:

CREATE TABLE mat_ver1_1 AS
SELECT rowid, * FROM my_fts5 WHERE my_fts5 MATCH 'ver1.1*';

CREATE INDEX idx_mat_ver1_1 ON mat_ver1_1(col1);

Query Routing:

def search_ver1_1(col1_val):
    if is_common_query(col1_val):
        return execute("SELECT * FROM mat_ver1_1 WHERE col1=?", col1_val)
    else:
        return execute("SELECT * FROM my_fts5 ...")

Refresh Strategy:

CREATE TRIGGER mat_refresh AFTER UPDATE ON my BEGIN
  DELETE FROM mat_ver1_1 WHERE rowid = OLD.rowid;
  INSERT INTO mat_ver1_1 SELECT NEW.*;
END;

Considerations:

  • Ideal for read-heavy workloads
  • Requires trigger maintenance
  • Storage overhead proportional to materialized data

8. Version-Specific Partitioning

Split data by major versions using partial indexes:

Table Partitioning:

CREATE TABLE my_ver1 (
  id INTEGER PRIMARY KEY,
  col1 INT,
  ...
);

CREATE VIRTUAL TABLE my_fts5_ver1 USING fts5(...);

Query Routing:

SELECT * FROM my_ver1 
WHERE col1 = 42 
  AND my_fts5_ver1 MATCH 'ver1.1*';

Benefits:

  • Reduces FTS5 index size per partition
  • Enables version-specific optimizations
  • Simplifies archival of obsolete versions

9. Statistical Profiling with EXPLAIN

Analyze query plans to identify bottlenecks:

Execution Plan Analysis:

EXPLAIN QUERY PLAN
SELECT * FROM my_fts5 
WHERE col1 = 42 AND my_fts5 MATCH 'ver1.1*';

Interpretation:

  • SCAN TABLE my_fts5 VIRTUAL TABLE INDEX 0:M indicates full-text scan
  • SEARCH TABLE my USING INDEX idx_col1 shows B-tree usage
  • USE TEMP B-TREE FOR ORDER BY suggests inefficient sorting

Optimization Targets:

  • Eliminate temporary B-trees through index-only scans
  • Convert SCAN to SEARCH in FTS5 using prefix indexes
  • Reorder WHERE clauses to leverage indexed columns first

10. FTS5 External Content Optimization

For large content tables, offload storage:

Schema Configuration:

CREATE VIRTUAL TABLE my_fts5 USING fts5(
  content_rowid='id',
  content='my',
  prefix='1',
  tokenize="unicode61"
);

Content Synchronization:

-- Enable external content
INSERT INTO my_fts5(my_fts5, rank) VALUES('rebuild', 1);

Behavior:

  • FTS5 stores only tokens, not original content
  • Original data retrieved via content='my' join
  • Reduces FTS5 storage overhead by 40-60%

Tradeoffs:

  • Requires strict trigger-based synchronization
  • rebuild operations are expensive
  • Queries slower without contentless=1 optimization

Final Recommendations

  1. Immediate Mitigation: Implement prefix indexes with prefix=1 and monitor pcache metrics
  2. Medium-Term Solution: Custom tokenizer treating ‘.’ as token character
  3. Long-Term Strategy: Version-aware schema partitioning with materialized views

Benchmark each approach using real-world data volumes and query patterns. Use SQLite’s -stats output to validate memory usage reductions and query plan improvements. For mission-critical systems, combine FTS5 optimizations with memory configuration tuning to handle large doclists efficiently.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *