FTS5 Synonym Handling: Dynamic Queries, Tokenizer Customization, and Phrase Matching
Issue Overview: FTS5 Synonym Configuration Challenges in Dynamic Environments
The core challenge revolves around effectively implementing synonym support within SQLite’s FTS5 extension when dealing with dynamic synonym lists that require real-time updates without full-text index rebuilds. This scenario presents three interconnected complexities:
Documentation Ambiguity in Synonym Method Specification
The official FTS5 documentation originally contained an inverted reference to synonym handling methods (2) and (3), potentially leading developers to misapply tokenization strategies. While this has been corrected, residual confusion persists regarding method selection criteria.Dynamic Synonym Management Constraints
Systems requiring on-the-fly synonym updates face architectural limitations with traditional FTS5 approaches. Method 1 (synonym folding during indexing) and Method 3 (document-side synonym expansion) mandate full index rebuilds for synonym changes, making them unsuitable for live environments. This forces implementers to Method 2 (query-side expansion), which introduces tokenization pipeline customization requirements.Phrase Query Compatibility with Colocated Tokens
Quoted phrase searches ("bsd history") demand special handling when using colocated tokens (FTS5_TOKEN_COLOCATED), as standard OR-based query rewriting fails within phrase boundaries. The solution requires precise tokenizer modification to maintain phrase integrity while expanding synonyms.
Possible Causes: Architectural Limitations and Tokenization Pipeline Design
1. Index-Time vs Query-Time Synonym Expansion Tradeoffs
- Method 1 (Index-Time Folding): Irreversible token substitution during indexing optimizes storage but permanently loses original token information. Any synonym list change invalidates existing indexes.
- Method 3 (Document-Side Expansion): Stores multiple token variants but requires complete document reindexing when synonyms change. Suitable for static synonym lists only.
- Method 2 (Query-Side Expansion): Preserves original document tokens while expanding query terms. Requires custom tokenizer implementation but allows dynamic synonym updates.
2. Tokenizer Inheritance Challenges
Custom tokenizers wrapping the unicode61 base must carefully manage:
- Flag propagation (FTS5_TOKENIZE_QUERY/DOCUMENT)
- Callback chaining between wrapper and base tokenizers
- Proper handling of colocation markers in query mode
- Version compatibility with future SQLite releases
3. Phrase Boundary Limitations in Standard Query Syntax
FTS5’s query parser treats quoted phrases as exact token sequences. Traditional OR-based expansion:
"bsd OR freebsd history" -- Syntax error
"(bsd OR freebsd) history" -- Phrase breakage
fails due to parser limitations, necessitating tokenizer-level intervention to maintain phrase continuity across synonyms.
Troubleshooting Steps, Solutions & Fixes: Implementing Robust Dynamic Synonym Handling
Phase 1: Tokenizer Customization for Method 2 Implementation
Step 1.1: Create Tokenizer Wrapper Architecture
Implement a custom tokenizer that proxies requests to unicode61 while injecting synonyms:
typedef struct SynonymTokenizer {
fts5_tokenizer base; // FTS5 tokenizer vtable
void *pUnicodeTokenizer; // unicode61 instance
SynonymMap *pSynonymMap; // Your dynamic synonym store
} SynonymTokenizer;
Step 1.2: Handle Tokenization Modes
Differentiate document vs query processing using FTS5_TOKENIZE_* flags:
int xTokenize(
void *pCtx,
int flags,
const char *pText, int nText,
int (*xToken)(void*, int, const char*, int, int, int)
) {
SynonymTokenizer *p = (SynonymTokenizer*)pCtx;
// Proxy to base tokenizer first
int rc = p->base.xTokenize(p->pUnicodeTokenizer, flags, pText, nText,
(flags & FTS5_TOKENIZE_QUERY) ? synonym_query_callback : xToken
);
if(flags & FTS5_TOKENIZE_QUERY) {
// Post-process collected tokens for synonym injection
inject_synonyms(p->pSynonymMap, xToken);
}
return rc;
}
Step 1.3: Implement Colocation Marking
During query tokenization, append synonyms with FTS5_TOKEN_COLOCATED:
void inject_synonyms(SynonymMap *pMap, Fts5TokenCallback xToken) {
for(Token *t = get_collected_tokens(); t; t=t->next) {
// Emit original token
xToken(t->pUser, t->iStart, t->pText, t->nText, t->iStart, t->iEnd);
// Emit synonyms as colocated
for(Synonym *s = find_synonyms(pMap, t); s; s=s->next) {
xToken(t->pUser,
FTS5_TOKEN_COLOCATED,
s->pTerm, s->nTerm,
t->iStart, t->iEnd
);
}
}
}
Step 1.4: Handle Token Position Tracking
Maintain accurate token positions for highlighting compatibility:
- Capture byte offsets from base tokenizer
- Reuse original positions for colocated synonyms
- Adjust highlight handling routines to recognize colocated tokens as position-equivalent
Phase 2: Dynamic Synonym Management Integration
Step 2.1: Implement Synonym Storage Layer
Use virtual tables for real-time synonym lookups:
CREATE TABLE dynamic_synonyms(
base_term TEXT PRIMARY KEY,
synonyms TEXT -- JSON array or comma-separated
);
-- Optional: Use FTS5 vocabulary tables for pattern matching
CREATE VIRTUAL TABLE vocab USING fts5vocab(main, instance);
Step 2.2: Pattern-Based Synonym Expansion
Combine FTS5 vocabulary tables with LIKE expressions for advanced matching:
SELECT term FROM vocab
WHERE term LIKE (SELECT synonym_pattern FROM synonym_triggers WHERE base_term=?);
Step 2.3: Cache Management Strategies
- Use SQLite’s UPDATE HOOKS to monitor synonym table changes
- Implement LRU caching for frequent synonym lookups
- Utilize partial indexes for active synonym subsets
Phase 3: Phrase Query Handling and Optimization
Step 3.1: Validate Phrase Matching Behavior
Test quoted phrase queries with synonym expansion:
-- Should match both "bsd history" and "freebsd history"
SELECT * FROM fts_table WHERE fts_table MATCH '"bsd history"';
Step 3.2: Analyze Query Execution Plans
Use EXPLAIN to verify synonym processing:
EXPLAIN
SELECT * FROM fts_table WHERE fts_table MATCH 'fts5: ( "bsd history" )';
-- Look for "TERM synonym:freebsd" in the plan
Step 3.3: Optimize Multi-Term Phrase Performance
- Create phrase prefix indexes for common synonym combinations
- Use covering indexes for frequent phrase patterns
- Implement batch processing for bulk synonym updates
Phase 4: Maintenance and Compatibility Assurance
Step 4.1: Versioning and Upgrade Paths
- Checks for SQLITE_VERSION_NUMBER in tokenizer code
- Fallback behaviors for deprecated features
- Unit test compatibility matrices across SQLite versions
Step 4.2: Performance Monitoring
Instrument the tokenizer with timing hooks:
#ifdef SYNONYM_DEBUG
clock_t start = clock();
inject_synonyms(...);
log_time(clock() - start);
#endif
Step 4.3: Regression Testing Suite
Create test cases covering:
- Single-term synonym expansion
- Multi-word phrase preservation
- Mixed case and Unicode handling
- Concurrency during synonym updates
- Long-term index stability
Advanced Considerations
Handling Substring Synonyms
For partial term matches (e.g., "bsd" → "freebsd"), implement:
if(strstr(base_term, "bsd")) {
inject_colocated("freebsd");
}
Multi-Token Synonym Injection
Expand single tokens to phrases:
// When expanding "os" → "operating system"
xToken(..., "operating", ...);
xToken(..., "system", FTS5_TOKEN_COLOCATED, ...);
Weighted Synonym Prioritization
Add synthetic term frequency markers:
xToken(..., "freebsd", FTS5_TOKEN_COLOCATED | (0x7F << 8), ...);
This comprehensive approach ensures dynamic synonym handling while maintaining query performance and index stability. Regular testing against SQLite updates and continuous monitoring of tokenization pipelines are critical for long-term success in production environments.