FTS5 Snippet Token Limit Exceeds Documented Maximum in Trigram Use Case
Understanding the FTS5 Snippet Token Limit Anomaly with Non-Standard Tokenizers
FTS5 Snippet Function Behavior with Tokenizer-Dependent Token Counts
The SQLite FTS5 snippet
function is designed to extract contextual text fragments containing search matches, with a documented limit of 64 tokens for the maximum returned text length. This limit is explicitly stated in the official documentation. However, empirical testing reveals that when using tokenizers like the trigram tokenizer (which splits text into overlapping 3-character sequences), the function accepts values beyond 64 tokens without errors and produces valid results. This contradicts the documentation’s assertion that the fifth argument "must be greater than zero and equal to or less than 64." The anomaly raises questions about whether the tokenizer implementation affects how token limits are enforced or interpreted.
The trigram tokenizer’s output fundamentally alters the relationship between visible text and the concept of a "token." For example, the word "Hake" becomes three tokens: "Hak," "ake." In contrast, the default Unicode61 tokenizer would treat "Hake" as a single token. When a user requests 70 tokens with a trigram tokenizer, they effectively request 70 × 3 = 210 characters (excluding overlaps), whereas 64 tokens with a word-based tokenizer might represent 64 words. This discrepancy suggests that the token limit parameter behaves differently depending on the tokenizer’s granularity, even though the documentation does not explicitly acknowledge this nuance.
Root Causes of Token Limit Flexibility in FTS5 Snippet Function
Tokenizer Granularity Mismatch
The FTS5 snippet function’s token counting logic operates on the abstract concept of tokens generated by the tokenizer. Tokenizers like trigram produce many small tokens per word, while others generate fewer, larger tokens. The documentation’s token limit assumes a typical word-based tokenizer, creating ambiguity when applied to tokenizers with finer granularity. For instance, a 64-token limit might yield a 200-character snippet with trigrams but a 500-character snippet with Unicode61. The documentation does not clarify whether the limit refers to tokens as defined by the tokenizer or a character-based equivalent.Undocumented Soft Enforcement of Token Limits
SQLite’s internal implementation of the snippet function may enforce the 64-token limit as a guideline rather than a strict rule. Testing shows that values up to 70 work without errors, suggesting either a buffer in the token allocation logic or a documentation oversight. The limit might be a recommendation to prevent excessively large snippets, but the actual enforcement could vary based on tokenizer output or internal buffer sizes.Misinterpretation of Token Counting in Snippet Assembly
The snippet function prioritizes including all match-related tokens, which may override the token limit when necessary. For example, if a match spans multiple tokens, the function could expand the snippet beyond the requested limit to ensure the entire match is visible. This behavior is not documented but aligns with observed results where larger token limits produce more coherent snippets in trigram-based searches.
Resolving Token Limit Conflicts and Optimizing Snippet Results
Step 1: Validate Tokenizer-Specific Token Limits
Test the snippet function with your tokenizer to determine practical limits. For trigram tokenizers, incrementally increase the token limit until the returned snippet stabilizes:
-- Test snippet token limits with trigram tokenizer
SELECT
snippet(fts_table, 0, '<b>', '</b>', '…', 70)
FROM fts_table
WHERE fts_table MATCH 'hake';
Compare results at 64, 70, and higher values. If no errors occur and snippets remain coherent, the tokenizer likely supports higher limits.
Step 2: Adjust Token Limits Based on Tokenizer Output Size
Calculate the effective character limit for your tokenizer. For trigrams:
Effective Characters ≈ Token Limit × 3 - Overlaps
A 70-token limit with trigrams covers ~210 characters. For word-based tokenizers, assume an average of 6 characters per token (including spaces). Adjust the token limit to align with your desired snippet length.
Step 3: Monitor SQLite Updates and Documentation Revisions
The discrepancy between observed behavior and documentation may stem from legacy constraints that no longer apply. Review SQLite’s changelog for FTS5 updates, particularly around token limits. If the limit is enforced in newer versions, conditionally apply token limits based on the SQLite version:
-- Check SQLite version
SELECT sqlite_version();
-- Use 64 tokens for versions ≥ 3.35.0, else 70
SELECT
snippet(fts_table, 0, '<b>', '</b>', '…',
CASE
WHEN CAST(substr(sqlite_version(), 1, 4) AS REAL) >= 3.35 THEN 64
ELSE 70
END
)
FROM fts_table;
Step 4: Implement Application-Level Snippet Truncation
If consistency across tokenizers is critical, use a lower token limit and truncate snippets programmatically. For example, request 64 tokens but trim the result to 200 characters:
# Python example: Truncate snippet after 200 characters
snippet = cursor.execute("SELECT snippet(...)").fetchone()[0]
truncated_snippet = snippet[:200] + '…' if len(snippet) > 200 else snippet
Step 5: Redesign Tokenizer Strategy for Balanced Snippet Granularity
If trigram tokenization forces impractical token limits, consider hybrid tokenization. Use a word tokenizer with partial trigram support for specific columns:
-- Create a trigram-virtual table for specific fields
CREATE VIRTUAL TABLE fts_hybrid USING fts5(
content,
tokenize = 'unicode61 separators " -_" trigram'
);
This approach allows word-based snippet limits while retaining trigram search capabilities for complex queries.
Final Recommendation
The FTS5 snippet function’s token limit is tokenizer-dependent and should be treated as a flexible parameter. For trigram tokenizers, values up to 70 tokens are empirically stable, but monitor SQLite updates for changes. Always benchmark snippet lengths against your data’s token distribution to ensure usability.