Detecting End of Token Stream in SQLite FTS5 Custom Tokenizers
Issue Overview: Inability to Identify Final Token in FTS5 Tokenizer Chain
The core challenge revolves around accurately determining when a custom tokenizer in an FTS5 tokenizer chain has reached the final token of the original input text. This problem manifests under two specific scenarios:
Trailing Characters Stripped by Upstream Tokenizers: When the original input ends with characters removed by an upstream tokenizer (e.g.,
unicode61
removing spaces/punctuation), downstream tokenizers lose visibility into the original input’s terminal position. A tokenizer comparing the end offset of its current token (iEnd
) against the original text length (nText
) will fail to detect stream termination if these characters were stripped.Stopword Elimination at Stream End: If the last lexical unit in the input is a stopword removed by a stopword filter tokenizer, downstream tokenizers receive no indication that the stream has ended after processing the preceding token. This creates ambiguity about whether additional tokens exist beyond the last processed token.
The fundamental limitation stems from SQLite FTS5’s tokenizer API design, where tokenizers in a chain operate sequentially without shared context about the original input’s complete lifecycle. Each tokenizer processes output from its predecessor as a self-contained text block, with no mechanism to signal global stream termination. This forces downstream tokenizers to implement fragile heuristics to infer stream completion.
Possible Causes: Tokenizer Chain Isolation and Input Modification
Three architectural factors contribute to this issue:
1. Tokenizer Chain Segmentation
FTS5 tokenizers execute in a sequential pipeline where each stage receives modified text from its predecessor. The unicode61
tokenizer (commonly used for basic normalization) alters the input by:
- Removing specified characters (whitespace, punctuation)
- Case-folding characters
- Applying Unicode category-based filtering
These transformations create a modified text buffer passed to subsequent tokenizers. A downstream tokenizer receives this altered text as its input, with no direct access to the original input’s metadata. The nText
parameter in xTokenize
refers to the current tokenizer’s input length, not the original document text length.
2. Absence of Stream Termination Signals
The FTS5 tokenizer API lacks an explicit end-of-stream notification mechanism. Tokenizers emit tokens via repeated calls to xToken
, but there’s no final callback to indicate all tokens have been emitted. This forces tokenizer implementations to infer stream completion through indirect means like position comparisons, which fail when upstream tokenizers alter text length.
3. Positional Offset Discrepancies
The iStart
and iEnd
parameters in xToken
represent byte offsets relative to the current tokenizer’s input buffer, not the original document. When upstream tokenizers modify the text (e.g., removing characters), these offsets become misaligned with the original input’s coordinate space. A tokenizer attempting to use iEnd == nText
as a termination condition will fail when:
- Upstream tokenizers have stripped trailing characters
- Multiple tokenizers modify the text in sequence
- The original input’s terminal token is eliminated (e.g., stopwords)
Troubleshooting Steps, Solutions & Fixes: Workarounds and Architectural Adjustments
A. Context-Aware Token Stream Termination Detection
Step 1: Implement Input Length Tracking
Create a custom tokenizer that tracks the original input length before any upstream processing. This requires intercepting the initial xTokenize
call at the start of the tokenizer chain:
typedef struct OriginalLengthContext {
int nOrigText; // Original input length
int bEos; // End-of-stream flag
} OriginalLengthContext;
static int xTokenizeOriginalTracker(
Fts5Tokenizer *pTokenizer,
void *pCtx,
const char *pText, int nText,
Fts5TokenCallback xCallback
){
OriginalLengthContext *ctx = (OriginalLengthContext *)pCtx;
if(ctx->nOrigText == -1){
// First call in chain: capture original length
ctx->nOrigText = nText;
}
// Invoke next tokenizer in chain
return xNextTokenizer->xTokenize(xNextTokenizer, pCtx, pText, nText, xCallback);
}
Step 2: Calculate Effective Terminal Offset
Modify downstream tokenizers to account for characters stripped by upstream tokenizers. For unicode61
-style stripping:
int effective_terminal_offset(const char *pOrigText, int nOrigText){
int i = nOrigText - 1;
// Reverse-scan until non-stripped character found
while(i >= 0 && (pOrigText[i] == ' ' || pOrigText[i] == '.')){
i--;
}
return i + 1; // Position after last non-stripped char
}
// In xTokenize:
if(iEnd >= effective_terminal_offset(pOrigText, nOrigText)){
// Last token in original stream
}
Step 3: Propagate Termination Flags Through Callbacks
Augment token emission to include termination status:
typedef struct EnhancedToken {
const char *pToken;
int nToken;
int bFinal;
} EnhancedToken;
static int xTokenCallback(void *pCtx, int tflags,
const char *pToken, int nToken,
int iStart, int iEnd
){
EnhancedToken *et = (EnhancedToken *)pCtx;
et->pToken = pToken;
et->nToken = nToken;
et->bFinal = (iEnd >= effective_terminal_offset(...));
// Process token with finality flag
return SQLITE_OK;
}
B. Custom Tokenizer Chain With Termination Awareness
Solution 1: Wrapper Tokenizer for Stream Metadata
Create a wrapper tokenizer that preserves original input metadata:
typedef struct ChainContext {
Fts5Tokenizer *pNext; // Next tokenizer in chain
int nOrigText; // Original input length
const char *pOrigText; // Original input pointer
} ChainContext;
static int xTokenizeChainWrapper(
Fts5Tokenizer *pTokenizer,
void *pCtx,
const char *pText, int nText,
Fts5TokenCallback xCallback
){
ChainContext *ctx = (ChainContext *)pCtx;
if(ctx->nOrigText == -1){
ctx->nOrigText = nText;
ctx->pOrigText = pText;
}
// Invoke next tokenizer with context preservation
return ctx->pNext->xTokenize(ctx->pNext, ctx, pText, nText, xCallback);
}
Solution 2: Hybrid Tokenizer With Lookahead
Implement a tokenizer that buffers tokens to detect termination:
typedef struct BufferedTokenizer {
Fts5Tokenizer *pNext;
EnhancedToken *tokens;
int nTokens;
int capacity;
} BufferedTokenizer;
static int xTokenizeBuffered(
Fts5Tokenizer *pTokenizer,
void *pCtx,
const char *pText, int nText,
Fts5TokenCallback xCallback
){
BufferedTokenizer *bt = (BufferedTokenizer *)pTokenizer;
// Collect all tokens from next tokenizer
int rc = bt->pNext->xTokenize(bt->pNext, pCtx, pText, nText, bufferedCallback);
// Post-process to mark last token
if(rc == SQLITE_OK && bt->nTokens > 0){
bt->tokens[bt->nTokens-1].bFinal = 1;
}
// Emit buffered tokens
for(int i=0; i<bt->nTokens; i++){
xCallback(pCtx, 0, bt->tokens[i].pToken, bt->tokens[i].nToken,
bt->tokens[i].iStart, bt->tokens[i].iEnd);
}
return rc;
}
C. SQLite FTS5 Extension Modifications
For developers able to modify SQLite’s source code:
Modification 1: Add FTS5_TOKEN_FINAL Flag
Extend the xToken
callback signature to include a finality flag:
- In
fts5Int.h
, modify theFts5TokenCallback
typedef:
typedef int (*Fts5TokenCallback)(
void *pCtx,
int tflags, // Existing flags
const char *pToken, // Token text
int nToken, // Token length
int iStart, // Start offset
int iEnd, // End offset
int bFinal // New final token flag
);
- In tokenizer implementations, set
bFinal=1
when emitting the last token.
Modification 2: Terminal Offset Tracking
Add original text metadata to tokenizer context:
typedef struct Fts5TokenizerContext {
const char *pOrigText;
int nOrigText;
// ... existing fields ...
} Fts5TokenizerContext;
void fts5TokenizerInitContext(
Fts5TokenizerContext *pCtx,
const char *pText, int nText
){
pCtx->pOrigText = pText;
pCtx->nOrigText = nText;
}
Modification 3: Enhanced Tokenizer API
Introduce new API methods for stream termination detection:
// Returns 1 if current token is last in original stream
int sqlite3_fts5_tokenizer_is_final(
Fts5TokenizerContext *pCtx,
int iEnd
){
int effEnd = effective_terminal_offset(pCtx->pOrigText, pCtx->nOrigText);
return iEnd >= effEnd;
}
D. Alternative Architectural Approaches
Approach 1: Preprocessing Pipeline
Shift text normalization outside the tokenizer chain:
- Create a preprocessing step that applies
unicode61
rules - Store both original and normalized text in shadow columns
- Use normalized text for tokenization
- Reference original text for termination detection
Approach 2: Dual-Phase Tokenization
Perform tokenization in two phases:
- Phase 1: Execute standard tokenizer chain
- Phase 2: Re-tokenize with original text, comparing offsets
- Merge phase results with termination flags
Approach 3: Proxy Tokenizer with Termination Hooks
Implement a proxy tokenizer that wraps existing tokenizers:
typedef struct ProxyTokenizer {
Fts5Tokenizer *pWrapped;
void (*xEos)(void*); // End-of-stream callback
} ProxyTokenizer;
static int xTokenizeProxy(
Fts5Tokenizer *pTokenizer,
void *pCtx,
const char *pText, int nText,
Fts5TokenCallback xCallback
){
ProxyTokenizer *pt = (ProxyTokenizer *)pTokenizer;
int rc = pt->pWrapped->xTokenize(pt->pWrapped, pCtx, pText, nText, xCallback);
if(rc == SQLITE_OK){
// Invoke EOS callback after last token
pt->xEos(pCtx);
}
return rc;
}
E. Mitigation Strategies for Common Scenarios
Scenario 1: Trailing Punctuation Stripping
- Solution: Compute effective terminal offset by reverse-scanning original text
- Implementation:
int compute_effective_end(const char *p, int n){ while(n > 0 && is_stripped_char(p[n-1])){ n--; } return n; }
Scenario 2: Last Token Stopword Removal
- Solution: Maintain a token buffer with lookahead
- Implementation:
typedef struct StopwordContext { int lastValidEnd; } StopwordContext; static int xTokenizeStopwords( Fts5Tokenizer *pTokenizer, void *pCtx, const char *pText, int nText, Fts5TokenCallback xCallback ){ StopwordContext *swc = (StopwordContext *)pCtx; if(!is_stopword(pText, nText)){ swc->lastValidEnd = iEnd; return xCallback(pCtx, 0, pText, nText, iStart, iEnd); } return SQLITE_OK; } // After tokenization completes: if(swc->lastValidEnd == effective_terminal_offset(...)){ // Emit final token marker }
F. Best Practices for Robust Token Stream Handling
Original Text Preservation
- Store original input text separately from normalized versions
- Use generated columns or shadow tables to maintain original text
Tokenizer Chain Design
- Place length-modifying tokenizers early in the chain
- Use wrapper tokenizers to capture pre-modification text state
Offset Translation Layers
- Maintain mapping between normalized and original text offsets
- Implement bi-directional offset conversion functions
Testing Methodologies
- Create test cases with trailing stripped characters
- Verify tokenizer behavior with empty inputs
- Validate stopword removal at stream end
Performance Considerations
- Cache effective terminal offset calculations
- Avoid redundant text scanning in hot paths
- Use memoization for stripped character detection
These solutions provide comprehensive strategies for addressing token stream termination detection in SQLite FTS5 tokenizer chains, balancing API constraints with practical implementation requirements. Developers should select the approach that best aligns with their specific use case complexity and performance requirements.