Overlapping Highlight Ranges in SQLite FTS5 Trigram Tokenization
Understanding the Behavior of highlight() and snippet() with Trigram Tokenization in FTS5
The highlight()
and snippet()
functions in SQLite’s Full-Text Search (FTS5) extension are designed to return annotated portions of text that match a query. These annotations help users quickly identify relevant segments in search results. However, when using the trigram tokenizer, these functions may produce unexpected results when adjacent or overlapping matches occur. Specifically, instead of merging contiguous or overlapping matched regions into a single highlighted span, the functions may generate multiple fragmented annotations.
For example, consider a virtual table created with tokenize="trigram"
containing the text "abcde". When searching for the terms "abc" AND "cde", the highlight()
function returns [abc]de[cde]
instead of a unified [abcde]
. This occurs because the trigram tokenizer splits the text into overlapping 3-character sequences (e.g., "abc", "bcd", "cde"), and the FTS5 engine does not automatically consolidate adjacent matches by default. The result is fragmented highlighting, which diminishes readability and utility.
This behavior contrasts with FTS5’s default tokenizer, which splits text on whitespace and punctuation. In the default mode, overlapping matches are more likely to be merged into a single span because tokens are discrete and non-overlapping. The trigram tokenizer’s design inherently creates overlapping tokens, leading to edge cases where the highlighting logic struggles to determine contiguous ranges.
Root Causes of Fragmented Highlighting in Trigram Tokenizer Mode
1. Tokenizer-Generated Overlapping Spans
The trigram tokenizer divides text into sliding 3-character windows. For the string "abcde", this produces tokens "abc", "bcd", and "cde". When a query matches multiple adjacent tokens (e.g., "abc" and "cde"), the FTS5 engine registers these as separate matches. The highlight()
function annotates each token individually unless explicitly instructed to merge adjacent spans.
In the example above, "abc" and "cde" are non-overlapping tokens but are adjacent in the original text. However, because the tokenizer does not provide positional metadata indicating that "abc" ends at offset 3 and "cde" starts at offset 2 (due to the sliding window), FTS5 cannot infer that these tokens form a contiguous block in the source text.
2. Lack of Span Merging Logic for Overlapping Tokens
The FTS5 highlight()
function uses a greedy algorithm to merge overlapping or adjacent matches. This works well for non-overlapping tokens (e.g., "a b c d e") but fails when tokens share characters. The trigram tokenizer’s output creates ambiguity: the same character may belong to multiple tokens, making it non-trivial to determine where one span ends and another begins.
For instance, in "abcde", the token "bcd" overlaps with both "abc" and "cde". If the query matches "abc" and "cde", the highlighting logic must decide whether to include "bcd" in the annotated range. The absence of context about token adjacency in the source text leads to保守的 (conservative) annotation, resulting in multiple brackets.
3. Positional Metadata Limitations in Trigram Tokenization
FTS5 relies on token positions to calculate highlight ranges. The trigram tokenizer assigns positions based on token occurrence, not their original text offsets. For example, in "abc x cde", the tokens "abc", "bcx", "c x", "x c", "cde" are assigned positions 1, 2, 3, 4, 5. When searching for "abc" (position 1) and "cde" (position 5), the gap between positions 1 and 5 suggests non-contiguous tokens, even though "abc" and "cde" are adjacent in the source text.
This positional discontinuity prevents the highlight()
function from recognizing that the matched tokens form a contiguous block. The result is fragmented highlighting, as the function annotates each token separately.
Resolving Overlapping Highlights in Trigram Tokenizer Mode
Step 1: Verify SQLite Version and Apply the Fix
The initial step is to confirm whether the SQLite version includes the e952db86faaafd2e commit, which addresses this issue. This fix modifies how FTS5 handles positional data for trigram tokens, enabling the highlight()
function to merge adjacent matches.
Action:
- Check the SQLite version:
SELECT sqlite_version();
Versions prior to 3.45.0 (released after October 2023) may not include the fix.
- If using an older version, update SQLite or apply the patch from the source repository.
Step 2: Custom Tokenizer Configuration for Span Merging
If updating SQLite is not feasible, implement a workaround by modifying tokenizer behavior or post-processing highlight results.
Workaround 1: Custom Trigram Tokenizer with Positional Metadata
Create a custom tokenizer that emits positional metadata reflecting the original text offsets. This requires writing a C extension, as SQLite’s built-in trigram tokenizer does not expose this data.
Example pseudocode for a custom tokenizer:
static int xTokenize(
void *pCtx,
const char *pText,
int nText,
Fts5Token *pToken
){
for(int i=0; i<=nText-3; i++){
pToken->iStart = i; // Start offset in source text
pToken->n = 3; // Token length
// Emit token
}
return SQLITE_OK;
}
This tokenizer assigns iStart
to each token, allowing FTS5 to calculate contiguous spans.
Workaround 2: Post-Process highlight() Output
Use SQLite’s REGEXP
extension or application-side logic to merge adjacent annotations.
Example regex replacement:
SELECT
REGEX_REPLACE(
highlight(ft2, 0, '[', ']'),
'\]\[',
''
) AS merged_highlight
FROM ft2;
This replaces "][" with an empty string, merging adjacent spans.
Step 3: Query Optimization to Minimize Fragmentation
Adjust the FTS5 query to reduce the likelihood of overlapping matches.
Technique 1: Phrase Queries
Use phrase queries to enforce token adjacency:
SELECT highlight(ft2, 0, '[', ']')
FROM ft2
WHERE ft2 MATCH '"abc cde"'; -- Matches contiguous "abc" followed by "cde"
This reduces fragmented highlights but requires knowing the exact token sequence.
Technique 2: NEAR Operators
The NEAR
operator ensures tokens appear within a specified distance:
SELECT highlight(ft2, 0, '[', ']')
FROM ft2
WHERE ft2 MATCH 'abc NEAR/0 cde'; -- Ensures "abc" and "cde" are adjacent
Step 4: Leveraging snippet() with Custom Start/End Markers
The snippet()
function allows specifying start/end markers and focus keywords. Use this to create larger context windows that encapsulate fragmented highlights.
Example:
SELECT snippet(
ft2,
0,
'[',
']',
'...',
64,
-15 -- Focus offset to center on first match
)
FROM ft2
WHERE ft2 MATCH 'abc AND cde';
This generates a 64-character snippet centered on the first match, reducing visual clutter from multiple annotations.
Step 5: Combining highlight() with Offsets
Use the offsets()
auxiliary function to manually calculate highlight ranges.
Example:
SELECT
offsets(ft2) AS off,
highlight(ft2, 0, '[', ']') AS hl
FROM ft2
WHERE ft2 MATCH 'abc AND cde';
The offsets()
output provides token positions and source text offsets, which can be parsed programmatically to merge overlapping spans.
Final Considerations
The optimal approach depends on SQLite version and application constraints:
- Upgrading SQLite is the simplest solution if possible.
- Custom tokenizers offer precise control but require C programming.
- Post-processing is portable but may impact performance.
- Query optimization balances usability and relevance.
By understanding the interplay between tokenization, positional metadata, and highlighting algorithms, developers can tailor solutions to their specific needs.