FTS5 Synonym Handling: Dynamic Queries, Tokenizer Customization, and Phrase Matching

Issue Overview: FTS5 Synonym Configuration Challenges in Dynamic Environments

The core challenge revolves around effectively implementing synonym support within SQLite’s FTS5 extension when dealing with dynamic synonym lists that require real-time updates without full-text index rebuilds. This scenario presents three interconnected complexities:

  1. Documentation Ambiguity in Synonym Method Specification
    The official FTS5 documentation originally contained an inverted reference to synonym handling methods (2) and (3), potentially leading developers to misapply tokenization strategies. While this has been corrected, residual confusion persists regarding method selection criteria.

  2. Dynamic Synonym Management Constraints
    Systems requiring on-the-fly synonym updates face architectural limitations with traditional FTS5 approaches. Method 1 (synonym folding during indexing) and Method 3 (document-side synonym expansion) mandate full index rebuilds for synonym changes, making them unsuitable for live environments. This forces implementers to Method 2 (query-side expansion), which introduces tokenization pipeline customization requirements.

  3. Phrase Query Compatibility with Colocated Tokens
    Quoted phrase searches ("bsd history") demand special handling when using colocated tokens (FTS5_TOKEN_COLOCATED), as standard OR-based query rewriting fails within phrase boundaries. The solution requires precise tokenizer modification to maintain phrase integrity while expanding synonyms.

Possible Causes: Architectural Limitations and Tokenization Pipeline Design

1. Index-Time vs Query-Time Synonym Expansion Tradeoffs

  • Method 1 (Index-Time Folding): Irreversible token substitution during indexing optimizes storage but permanently loses original token information. Any synonym list change invalidates existing indexes.
  • Method 3 (Document-Side Expansion): Stores multiple token variants but requires complete document reindexing when synonyms change. Suitable for static synonym lists only.
  • Method 2 (Query-Side Expansion): Preserves original document tokens while expanding query terms. Requires custom tokenizer implementation but allows dynamic synonym updates.

2. Tokenizer Inheritance Challenges
Custom tokenizers wrapping the unicode61 base must carefully manage:

  • Flag propagation (FTS5_TOKENIZE_QUERY/DOCUMENT)
  • Callback chaining between wrapper and base tokenizers
  • Proper handling of colocation markers in query mode
  • Version compatibility with future SQLite releases

3. Phrase Boundary Limitations in Standard Query Syntax
FTS5’s query parser treats quoted phrases as exact token sequences. Traditional OR-based expansion:

"bsd OR freebsd history"  -- Syntax error
"(bsd OR freebsd) history"  -- Phrase breakage

fails due to parser limitations, necessitating tokenizer-level intervention to maintain phrase continuity across synonyms.

Troubleshooting Steps, Solutions & Fixes: Implementing Robust Dynamic Synonym Handling

Phase 1: Tokenizer Customization for Method 2 Implementation

Step 1.1: Create Tokenizer Wrapper Architecture

Implement a custom tokenizer that proxies requests to unicode61 while injecting synonyms:

typedef struct SynonymTokenizer {
  fts5_tokenizer base;        // FTS5 tokenizer vtable
  void *pUnicodeTokenizer;    // unicode61 instance
  SynonymMap *pSynonymMap;    // Your dynamic synonym store
} SynonymTokenizer;

Step 1.2: Handle Tokenization Modes

Differentiate document vs query processing using FTS5_TOKENIZE_* flags:

int xTokenize(
  void *pCtx, 
  int flags, 
  const char *pText, int nText, 
  int (*xToken)(void*, int, const char*, int, int, int)
) {
  SynonymTokenizer *p = (SynonymTokenizer*)pCtx;
  
  // Proxy to base tokenizer first
  int rc = p->base.xTokenize(p->pUnicodeTokenizer, flags, pText, nText, 
    (flags & FTS5_TOKENIZE_QUERY) ? synonym_query_callback : xToken
  );
  
  if(flags & FTS5_TOKENIZE_QUERY) {
    // Post-process collected tokens for synonym injection
    inject_synonyms(p->pSynonymMap, xToken);
  }
  return rc;
}

Step 1.3: Implement Colocation Marking

During query tokenization, append synonyms with FTS5_TOKEN_COLOCATED:

void inject_synonyms(SynonymMap *pMap, Fts5TokenCallback xToken) {
  for(Token *t = get_collected_tokens(); t; t=t->next) {
    // Emit original token
    xToken(t->pUser, t->iStart, t->pText, t->nText, t->iStart, t->iEnd);
    
    // Emit synonyms as colocated
    for(Synonym *s = find_synonyms(pMap, t); s; s=s->next) {
      xToken(t->pUser, 
        FTS5_TOKEN_COLOCATED, 
        s->pTerm, s->nTerm, 
        t->iStart, t->iEnd
      );
    }
  }
}

Step 1.4: Handle Token Position Tracking

Maintain accurate token positions for highlighting compatibility:

  • Capture byte offsets from base tokenizer
  • Reuse original positions for colocated synonyms
  • Adjust highlight handling routines to recognize colocated tokens as position-equivalent

Phase 2: Dynamic Synonym Management Integration

Step 2.1: Implement Synonym Storage Layer

Use virtual tables for real-time synonym lookups:

CREATE TABLE dynamic_synonyms(
  base_term TEXT PRIMARY KEY,
  synonyms TEXT -- JSON array or comma-separated
);

-- Optional: Use FTS5 vocabulary tables for pattern matching
CREATE VIRTUAL TABLE vocab USING fts5vocab(main, instance);

Step 2.2: Pattern-Based Synonym Expansion

Combine FTS5 vocabulary tables with LIKE expressions for advanced matching:

SELECT term FROM vocab 
WHERE term LIKE (SELECT synonym_pattern FROM synonym_triggers WHERE base_term=?);

Step 2.3: Cache Management Strategies

  • Use SQLite’s UPDATE HOOKS to monitor synonym table changes
  • Implement LRU caching for frequent synonym lookups
  • Utilize partial indexes for active synonym subsets

Phase 3: Phrase Query Handling and Optimization

Step 3.1: Validate Phrase Matching Behavior

Test quoted phrase queries with synonym expansion:

-- Should match both "bsd history" and "freebsd history"
SELECT * FROM fts_table WHERE fts_table MATCH '"bsd history"';

Step 3.2: Analyze Query Execution Plans

Use EXPLAIN to verify synonym processing:

EXPLAIN 
SELECT * FROM fts_table WHERE fts_table MATCH 'fts5: ( "bsd history" )';
-- Look for "TERM synonym:freebsd" in the plan

Step 3.3: Optimize Multi-Term Phrase Performance

  • Create phrase prefix indexes for common synonym combinations
  • Use covering indexes for frequent phrase patterns
  • Implement batch processing for bulk synonym updates

Phase 4: Maintenance and Compatibility Assurance

Step 4.1: Versioning and Upgrade Paths

  • Checks for SQLITE_VERSION_NUMBER in tokenizer code
  • Fallback behaviors for deprecated features
  • Unit test compatibility matrices across SQLite versions

Step 4.2: Performance Monitoring

Instrument the tokenizer with timing hooks:

#ifdef SYNONYM_DEBUG
  clock_t start = clock();
  inject_synonyms(...);
  log_time(clock() - start);
#endif

Step 4.3: Regression Testing Suite

Create test cases covering:

  • Single-term synonym expansion
  • Multi-word phrase preservation
  • Mixed case and Unicode handling
  • Concurrency during synonym updates
  • Long-term index stability

Advanced Considerations

Handling Substring Synonyms
For partial term matches (e.g., "bsd" → "freebsd"), implement:

if(strstr(base_term, "bsd")) {
  inject_colocated("freebsd");
}

Multi-Token Synonym Injection
Expand single tokens to phrases:

// When expanding "os" → "operating system"
xToken(..., "operating", ...);
xToken(..., "system", FTS5_TOKEN_COLOCATED, ...);

Weighted Synonym Prioritization
Add synthetic term frequency markers:

xToken(..., "freebsd", FTS5_TOKEN_COLOCATED | (0x7F << 8), ...);

This comprehensive approach ensures dynamic synonym handling while maintaining query performance and index stability. Regular testing against SQLite updates and continuous monitoring of tokenization pipelines are critical for long-term success in production environments.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *