Detecting End of Token Stream in SQLite FTS5 Custom Tokenizers

Issue Overview: Inability to Identify Final Token in FTS5 Tokenizer Chain

The core challenge revolves around accurately determining when a custom tokenizer in an FTS5 tokenizer chain has reached the final token of the original input text. This problem manifests under two specific scenarios:

  1. Trailing Characters Stripped by Upstream Tokenizers: When the original input ends with characters removed by an upstream tokenizer (e.g., unicode61 removing spaces/punctuation), downstream tokenizers lose visibility into the original input’s terminal position. A tokenizer comparing the end offset of its current token (iEnd) against the original text length (nText) will fail to detect stream termination if these characters were stripped.

  2. Stopword Elimination at Stream End: If the last lexical unit in the input is a stopword removed by a stopword filter tokenizer, downstream tokenizers receive no indication that the stream has ended after processing the preceding token. This creates ambiguity about whether additional tokens exist beyond the last processed token.

The fundamental limitation stems from SQLite FTS5’s tokenizer API design, where tokenizers in a chain operate sequentially without shared context about the original input’s complete lifecycle. Each tokenizer processes output from its predecessor as a self-contained text block, with no mechanism to signal global stream termination. This forces downstream tokenizers to implement fragile heuristics to infer stream completion.

Possible Causes: Tokenizer Chain Isolation and Input Modification

Three architectural factors contribute to this issue:

1. Tokenizer Chain Segmentation

FTS5 tokenizers execute in a sequential pipeline where each stage receives modified text from its predecessor. The unicode61 tokenizer (commonly used for basic normalization) alters the input by:

  • Removing specified characters (whitespace, punctuation)
  • Case-folding characters
  • Applying Unicode category-based filtering

These transformations create a modified text buffer passed to subsequent tokenizers. A downstream tokenizer receives this altered text as its input, with no direct access to the original input’s metadata. The nText parameter in xTokenize refers to the current tokenizer’s input length, not the original document text length.

2. Absence of Stream Termination Signals

The FTS5 tokenizer API lacks an explicit end-of-stream notification mechanism. Tokenizers emit tokens via repeated calls to xToken, but there’s no final callback to indicate all tokens have been emitted. This forces tokenizer implementations to infer stream completion through indirect means like position comparisons, which fail when upstream tokenizers alter text length.

3. Positional Offset Discrepancies

The iStart and iEnd parameters in xToken represent byte offsets relative to the current tokenizer’s input buffer, not the original document. When upstream tokenizers modify the text (e.g., removing characters), these offsets become misaligned with the original input’s coordinate space. A tokenizer attempting to use iEnd == nText as a termination condition will fail when:

  • Upstream tokenizers have stripped trailing characters
  • Multiple tokenizers modify the text in sequence
  • The original input’s terminal token is eliminated (e.g., stopwords)

Troubleshooting Steps, Solutions & Fixes: Workarounds and Architectural Adjustments

A. Context-Aware Token Stream Termination Detection

Step 1: Implement Input Length Tracking
Create a custom tokenizer that tracks the original input length before any upstream processing. This requires intercepting the initial xTokenize call at the start of the tokenizer chain:

typedef struct OriginalLengthContext {
  int nOrigText;  // Original input length
  int bEos;       // End-of-stream flag
} OriginalLengthContext;

static int xTokenizeOriginalTracker(
  Fts5Tokenizer *pTokenizer,
  void *pCtx,
  const char *pText, int nText,
  Fts5TokenCallback xCallback
){
  OriginalLengthContext *ctx = (OriginalLengthContext *)pCtx;
  if(ctx->nOrigText == -1){
    // First call in chain: capture original length
    ctx->nOrigText = nText;
  }
  // Invoke next tokenizer in chain
  return xNextTokenizer->xTokenize(xNextTokenizer, pCtx, pText, nText, xCallback);
}

Step 2: Calculate Effective Terminal Offset
Modify downstream tokenizers to account for characters stripped by upstream tokenizers. For unicode61-style stripping:

int effective_terminal_offset(const char *pOrigText, int nOrigText){
  int i = nOrigText - 1;
  // Reverse-scan until non-stripped character found
  while(i >= 0 && (pOrigText[i] == ' ' || pOrigText[i] == '.')){
    i--;
  }
  return i + 1;  // Position after last non-stripped char
}

// In xTokenize:
if(iEnd >= effective_terminal_offset(pOrigText, nOrigText)){
  // Last token in original stream
}

Step 3: Propagate Termination Flags Through Callbacks
Augment token emission to include termination status:

typedef struct EnhancedToken {
  const char *pToken;
  int nToken;
  int bFinal;
} EnhancedToken;

static int xTokenCallback(void *pCtx, int tflags,
  const char *pToken, int nToken,
  int iStart, int iEnd
){
  EnhancedToken *et = (EnhancedToken *)pCtx;
  et->pToken = pToken;
  et->nToken = nToken;
  et->bFinal = (iEnd >= effective_terminal_offset(...));
  // Process token with finality flag
  return SQLITE_OK;
}

B. Custom Tokenizer Chain With Termination Awareness

Solution 1: Wrapper Tokenizer for Stream Metadata
Create a wrapper tokenizer that preserves original input metadata:

typedef struct ChainContext {
  Fts5Tokenizer *pNext;  // Next tokenizer in chain
  int nOrigText;         // Original input length
  const char *pOrigText; // Original input pointer
} ChainContext;

static int xTokenizeChainWrapper(
  Fts5Tokenizer *pTokenizer,
  void *pCtx,
  const char *pText, int nText,
  Fts5TokenCallback xCallback
){
  ChainContext *ctx = (ChainContext *)pCtx;
  if(ctx->nOrigText == -1){
    ctx->nOrigText = nText;
    ctx->pOrigText = pText;
  }
  // Invoke next tokenizer with context preservation
  return ctx->pNext->xTokenize(ctx->pNext, ctx, pText, nText, xCallback);
}

Solution 2: Hybrid Tokenizer With Lookahead
Implement a tokenizer that buffers tokens to detect termination:

typedef struct BufferedTokenizer {
  Fts5Tokenizer *pNext;
  EnhancedToken *tokens;
  int nTokens;
  int capacity;
} BufferedTokenizer;

static int xTokenizeBuffered(
  Fts5Tokenizer *pTokenizer,
  void *pCtx,
  const char *pText, int nText,
  Fts5TokenCallback xCallback
){
  BufferedTokenizer *bt = (BufferedTokenizer *)pTokenizer;
  // Collect all tokens from next tokenizer
  int rc = bt->pNext->xTokenize(bt->pNext, pCtx, pText, nText, bufferedCallback);
  // Post-process to mark last token
  if(rc == SQLITE_OK && bt->nTokens > 0){
    bt->tokens[bt->nTokens-1].bFinal = 1;
  }
  // Emit buffered tokens
  for(int i=0; i<bt->nTokens; i++){
    xCallback(pCtx, 0, bt->tokens[i].pToken, bt->tokens[i].nToken,
              bt->tokens[i].iStart, bt->tokens[i].iEnd);
  }
  return rc;
}

C. SQLite FTS5 Extension Modifications

For developers able to modify SQLite’s source code:

Modification 1: Add FTS5_TOKEN_FINAL Flag
Extend the xToken callback signature to include a finality flag:

  1. In fts5Int.h, modify the Fts5TokenCallback typedef:
typedef int (*Fts5TokenCallback)(
  void *pCtx, 
  int tflags,            // Existing flags
  const char *pToken,    // Token text
  int nToken,            // Token length
  int iStart,            // Start offset
  int iEnd,              // End offset
  int bFinal             // New final token flag
);
  1. In tokenizer implementations, set bFinal=1 when emitting the last token.

Modification 2: Terminal Offset Tracking
Add original text metadata to tokenizer context:

typedef struct Fts5TokenizerContext {
  const char *pOrigText;
  int nOrigText;
  // ... existing fields ...
} Fts5TokenizerContext;

void fts5TokenizerInitContext(
  Fts5TokenizerContext *pCtx,
  const char *pText, int nText
){
  pCtx->pOrigText = pText;
  pCtx->nOrigText = nText;
}

Modification 3: Enhanced Tokenizer API
Introduce new API methods for stream termination detection:

// Returns 1 if current token is last in original stream
int sqlite3_fts5_tokenizer_is_final(
  Fts5TokenizerContext *pCtx,
  int iEnd
){
  int effEnd = effective_terminal_offset(pCtx->pOrigText, pCtx->nOrigText);
  return iEnd >= effEnd;
}

D. Alternative Architectural Approaches

Approach 1: Preprocessing Pipeline
Shift text normalization outside the tokenizer chain:

  1. Create a preprocessing step that applies unicode61 rules
  2. Store both original and normalized text in shadow columns
  3. Use normalized text for tokenization
  4. Reference original text for termination detection

Approach 2: Dual-Phase Tokenization
Perform tokenization in two phases:

  1. Phase 1: Execute standard tokenizer chain
  2. Phase 2: Re-tokenize with original text, comparing offsets
  3. Merge phase results with termination flags

Approach 3: Proxy Tokenizer with Termination Hooks
Implement a proxy tokenizer that wraps existing tokenizers:

typedef struct ProxyTokenizer {
  Fts5Tokenizer *pWrapped;
  void (*xEos)(void*);  // End-of-stream callback
} ProxyTokenizer;

static int xTokenizeProxy(
  Fts5Tokenizer *pTokenizer,
  void *pCtx,
  const char *pText, int nText,
  Fts5TokenCallback xCallback
){
  ProxyTokenizer *pt = (ProxyTokenizer *)pTokenizer;
  int rc = pt->pWrapped->xTokenize(pt->pWrapped, pCtx, pText, nText, xCallback);
  if(rc == SQLITE_OK){
    // Invoke EOS callback after last token
    pt->xEos(pCtx);
  }
  return rc;
}

E. Mitigation Strategies for Common Scenarios

Scenario 1: Trailing Punctuation Stripping

  • Solution: Compute effective terminal offset by reverse-scanning original text
  • Implementation:
    int compute_effective_end(const char *p, int n){
      while(n > 0 && is_stripped_char(p[n-1])){
        n--;
      }
      return n;
    }
    

Scenario 2: Last Token Stopword Removal

  • Solution: Maintain a token buffer with lookahead
  • Implementation:
    typedef struct StopwordContext {
      int lastValidEnd;
    } StopwordContext;
    
    static int xTokenizeStopwords(
      Fts5Tokenizer *pTokenizer,
      void *pCtx,
      const char *pText, int nText,
      Fts5TokenCallback xCallback
    ){
      StopwordContext *swc = (StopwordContext *)pCtx;
      if(!is_stopword(pText, nText)){
        swc->lastValidEnd = iEnd;
        return xCallback(pCtx, 0, pText, nText, iStart, iEnd);
      }
      return SQLITE_OK;
    }
    
    // After tokenization completes:
    if(swc->lastValidEnd == effective_terminal_offset(...)){
      // Emit final token marker
    }
    

F. Best Practices for Robust Token Stream Handling

  1. Original Text Preservation

    • Store original input text separately from normalized versions
    • Use generated columns or shadow tables to maintain original text
  2. Tokenizer Chain Design

    • Place length-modifying tokenizers early in the chain
    • Use wrapper tokenizers to capture pre-modification text state
  3. Offset Translation Layers

    • Maintain mapping between normalized and original text offsets
    • Implement bi-directional offset conversion functions
  4. Testing Methodologies

    • Create test cases with trailing stripped characters
    • Verify tokenizer behavior with empty inputs
    • Validate stopword removal at stream end
  5. Performance Considerations

    • Cache effective terminal offset calculations
    • Avoid redundant text scanning in hot paths
    • Use memoization for stripped character detection

These solutions provide comprehensive strategies for addressing token stream termination detection in SQLite FTS5 tokenizer chains, balancing API constraints with practical implementation requirements. Developers should select the approach that best aligns with their specific use case complexity and performance requirements.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *