Overlapping Highlight Ranges in SQLite FTS5 Trigram Tokenization

Understanding the Behavior of highlight() and snippet() with Trigram Tokenization in FTS5

The highlight() and snippet() functions in SQLite’s Full-Text Search (FTS5) extension are designed to return annotated portions of text that match a query. These annotations help users quickly identify relevant segments in search results. However, when using the trigram tokenizer, these functions may produce unexpected results when adjacent or overlapping matches occur. Specifically, instead of merging contiguous or overlapping matched regions into a single highlighted span, the functions may generate multiple fragmented annotations.

For example, consider a virtual table created with tokenize="trigram" containing the text "abcde". When searching for the terms "abc" AND "cde", the highlight() function returns [abc]de[cde] instead of a unified [abcde]. This occurs because the trigram tokenizer splits the text into overlapping 3-character sequences (e.g., "abc", "bcd", "cde"), and the FTS5 engine does not automatically consolidate adjacent matches by default. The result is fragmented highlighting, which diminishes readability and utility.

This behavior contrasts with FTS5’s default tokenizer, which splits text on whitespace and punctuation. In the default mode, overlapping matches are more likely to be merged into a single span because tokens are discrete and non-overlapping. The trigram tokenizer’s design inherently creates overlapping tokens, leading to edge cases where the highlighting logic struggles to determine contiguous ranges.

Root Causes of Fragmented Highlighting in Trigram Tokenizer Mode

1. Tokenizer-Generated Overlapping Spans

The trigram tokenizer divides text into sliding 3-character windows. For the string "abcde", this produces tokens "abc", "bcd", and "cde". When a query matches multiple adjacent tokens (e.g., "abc" and "cde"), the FTS5 engine registers these as separate matches. The highlight() function annotates each token individually unless explicitly instructed to merge adjacent spans.

In the example above, "abc" and "cde" are non-overlapping tokens but are adjacent in the original text. However, because the tokenizer does not provide positional metadata indicating that "abc" ends at offset 3 and "cde" starts at offset 2 (due to the sliding window), FTS5 cannot infer that these tokens form a contiguous block in the source text.

2. Lack of Span Merging Logic for Overlapping Tokens

The FTS5 highlight() function uses a greedy algorithm to merge overlapping or adjacent matches. This works well for non-overlapping tokens (e.g., "a b c d e") but fails when tokens share characters. The trigram tokenizer’s output creates ambiguity: the same character may belong to multiple tokens, making it non-trivial to determine where one span ends and another begins.

For instance, in "abcde", the token "bcd" overlaps with both "abc" and "cde". If the query matches "abc" and "cde", the highlighting logic must decide whether to include "bcd" in the annotated range. The absence of context about token adjacency in the source text leads to保守的 (conservative) annotation, resulting in multiple brackets.

3. Positional Metadata Limitations in Trigram Tokenization

FTS5 relies on token positions to calculate highlight ranges. The trigram tokenizer assigns positions based on token occurrence, not their original text offsets. For example, in "abc x cde", the tokens "abc", "bcx", "c x", "x c", "cde" are assigned positions 1, 2, 3, 4, 5. When searching for "abc" (position 1) and "cde" (position 5), the gap between positions 1 and 5 suggests non-contiguous tokens, even though "abc" and "cde" are adjacent in the source text.

This positional discontinuity prevents the highlight() function from recognizing that the matched tokens form a contiguous block. The result is fragmented highlighting, as the function annotates each token separately.

Resolving Overlapping Highlights in Trigram Tokenizer Mode

Step 1: Verify SQLite Version and Apply the Fix

The initial step is to confirm whether the SQLite version includes the e952db86faaafd2e commit, which addresses this issue. This fix modifies how FTS5 handles positional data for trigram tokens, enabling the highlight() function to merge adjacent matches.

Action:

Check the SQLite version:
```
SELECT sqlite_version();
```
Versions prior to 3.45.0 (released after October 2023) may not include the fix.
If using an older version, update SQLite or apply the patch from the source repository.

Step 2: Custom Tokenizer Configuration for Span Merging

If updating SQLite is not feasible, implement a workaround by modifying tokenizer behavior or post-processing highlight results.

Workaround 1: Custom Trigram Tokenizer with Positional Metadata
Create a custom tokenizer that emits positional metadata reflecting the original text offsets. This requires writing a C extension, as SQLite’s built-in trigram tokenizer does not expose this data.

Example pseudocode for a custom tokenizer:

static int xTokenize(
  void *pCtx, 
  const char *pText, 
  int nText, 
  Fts5Token *pToken
){
  for(int i=0; i<=nText-3; i++){
    pToken->iStart = i;  // Start offset in source text
    pToken->n = 3;       // Token length
    // Emit token
  }
  return SQLITE_OK;
}

This tokenizer assigns iStart to each token, allowing FTS5 to calculate contiguous spans.

Workaround 2: Post-Process highlight() Output
Use SQLite’s REGEXP extension or application-side logic to merge adjacent annotations.

Example regex replacement:

SELECT 
  REGEX_REPLACE(
    highlight(ft2, 0, '[', ']'), 
    '\]\[', 
    ''
  ) AS merged_highlight 
FROM ft2;

This replaces "][" with an empty string, merging adjacent spans.

Step 3: Query Optimization to Minimize Fragmentation

Adjust the FTS5 query to reduce the likelihood of overlapping matches.

Technique 1: Phrase Queries
Use phrase queries to enforce token adjacency:

SELECT highlight(ft2, 0, '[', ']') 
FROM ft2 
WHERE ft2 MATCH '"abc cde"';  -- Matches contiguous "abc" followed by "cde"

This reduces fragmented highlights but requires knowing the exact token sequence.

Technique 2: NEAR Operators
The NEAR operator ensures tokens appear within a specified distance:

SELECT highlight(ft2, 0, '[', ']') 
FROM ft2 
WHERE ft2 MATCH 'abc NEAR/0 cde';  -- Ensures "abc" and "cde" are adjacent

Step 4: Leveraging snippet() with Custom Start/End Markers

The snippet() function allows specifying start/end markers and focus keywords. Use this to create larger context windows that encapsulate fragmented highlights.

Example:

SELECT snippet(
  ft2, 
  0, 
  '[', 
  ']', 
  '...', 
  64, 
  -15  -- Focus offset to center on first match
) 
FROM ft2 
WHERE ft2 MATCH 'abc AND cde';

This generates a 64-character snippet centered on the first match, reducing visual clutter from multiple annotations.

Step 5: Combining highlight() with Offsets

Use the offsets() auxiliary function to manually calculate highlight ranges.

Example:

SELECT 
  offsets(ft2) AS off,
  highlight(ft2, 0, '[', ']') AS hl 
FROM ft2 
WHERE ft2 MATCH 'abc AND cde';

The offsets() output provides token positions and source text offsets, which can be parsed programmatically to merge overlapping spans.

Final Considerations

The optimal approach depends on SQLite version and application constraints:

Upgrading SQLite is the simplest solution if possible.
Custom tokenizers offer precise control but require C programming.
Post-processing is portable but may impact performance.
Query optimization balances usability and relevance.

By understanding the interplay between tokenization, positional metadata, and highlighting algorithms, developers can tailor solutions to their specific needs.

Overlapping Highlight Ranges in SQLite FTS5 Trigram Tokenization

Understanding the Behavior of highlight() and snippet() with Trigram Tokenization in FTS5

Root Causes of Fragmented Highlighting in Trigram Tokenizer Mode

1. Tokenizer-Generated Overlapping Spans

2. Lack of Span Merging Logic for Overlapping Tokens

3. Positional Metadata Limitations in Trigram Tokenization

Resolving Overlapping Highlights in Trigram Tokenizer Mode

Step 1: Verify SQLite Version and Apply the Fix

Step 2: Custom Tokenizer Configuration for Span Merging

Step 3: Query Optimization to Minimize Fragmentation

Step 4: Leveraging snippet() with Custom Start/End Markers

Step 5: Combining highlight() with Offsets

Final Considerations

JSON Validation Inconsistencies in SQLite CLI: Debugging json_valid() and File Handling

SQLite CHECK Constraints and Date Validation Techniques

SQLite hex() Function Behavior and Correct Decimal to Hexadecimal Conversion

SQLite NOT NULL Constraint Optimization Issue on Corrupted Databases

Adding Dynamic Years to Date in SQLite Using Column Values

Using Subquery Results for ORDER BY in SQLite Queries

Leave a Reply Cancel reply

Understanding the Behavior of highlight() and snippet() with Trigram Tokenization in FTS5

Root Causes of Fragmented Highlighting in Trigram Tokenizer Mode

1. Tokenizer-Generated Overlapping Spans

2. Lack of Span Merging Logic for Overlapping Tokens

3. Positional Metadata Limitations in Trigram Tokenization

Resolving Overlapping Highlights in Trigram Tokenizer Mode

Step 1: Verify SQLite Version and Apply the Fix

Step 2: Custom Tokenizer Configuration for Span Merging

Step 3: Query Optimization to Minimize Fragmentation

Step 4: Leveraging snippet() with Custom Start/End Markers

Step 5: Combining highlight() with Offsets

Final Considerations

Related Guides

Leave a Reply Cancel reply