Retrieving FTS5 Locale Data for Tokenization and Content Reconstruction
Understanding the FTS5 Locale Retrieval Limitation in Tokenization Workflows
The core challenge arises when attempting to map tokens generated by SQLite’s FTS5 full-text search engine back to their original source text. FTS5 tokenization applies transformations such as case folding, accent removal, stemming, and compatibility normalization, which often render tokens unrecognizable compared to the original human-readable content. For applications requiring accurate reconstruction of source text from tokens—such as generating "Did you mean?" suggestions, auditing search behavior, or debugging tokenization rules—the ability to trace tokens to their source text is critical.
A critical dependency in this process is the locale used during tokenization. The locale determines language-specific rules for transformations (e.g., stemming in English vs. French) and character normalization. When a locale is specified for an FTS5 table column, it influences how content is tokenized and stored. However, prior to the introduction of the fts5_locale
enhancement, there was no built-in mechanism to retrieve the locale associated with a specific row or column outside the context of a query. This limitation becomes acute when attempting to re-tokenize content programmatically using the correct locale settings to align byte offsets with the original text.
For example, consider an FTS5 table with columns tokenized using different locales. A token like "run" might map to "running" in an English-locale column but remain "run" in a German-locale column due to differing stemming rules. If the locale is not retrievable, re-tokenizing content to identify source text segments corresponding to tokens will produce incorrect byte offsets, leading to inaccurate text extraction. This problem is exacerbated in systems where locales vary dynamically per row or column, as manual tracking of locale assignments is error-prone and impractical.
Why Locale Metadata Isn’t Directly Accessible in FTS5
The absence of a direct method to retrieve locale information outside query execution stems from FTS5’s architectural design and the tokenizer API’s historical limitations. Before the fts5_locale
integration, tokenizers operated without explicit locale context, relying on global settings or implicit assumptions. The introduction of per-row/column locale storage was a significant enhancement but initially lacked a public interface for external access. This created a gap between the internal storage of locale data and its availability for use in auxiliary processes like content reconstruction.
Auxiliary functions such as snippet
and highlight
internally re-tokenize content using the stored locale to generate contextually accurate results. However, these functions operate within the query execution context, where the locale is implicitly available. External tools or procedures that require locale data—such as batch analysis scripts or diagnostic utilities—cannot leverage this internal mechanism. The FTS5 extension API prior to fts5_locale
did not expose methods to query locale metadata, leaving developers to rely on workarounds like duplicating locale settings in external metadata tables, which introduces synchronization risks and complexity.
The problem is further complicated by the tokenizer versioning system. Tokenizers using the v2 API can accept locale parameters, but older tokenizers (v0/v1) do not support this. Applications aiming for backward compatibility must handle cases where locale data is absent or inconsistently applied. Without a unified method to retrieve locale settings, developers face fragmented logic when reconciling tokens with source text across heterogeneous tokenizer configurations.
Implementing fts5_get_locale for Accurate Tokenization and Content Reconstruction
The solution to this challenge is the fts5_get_locale
auxiliary function introduced in SQLite trunk. This function allows explicit retrieval of the locale associated with a specific row and column in an FTS5 table. Its syntax is designed to integrate seamlessly with SQL queries while avoiding ambiguity:
SELECT fts5_get_locale(fts5_table_name, column_identifier, rowid) FROM fts5_table WHERE ...;
Parameters:
fts5_table_name
: The name of the FTS5 virtual table.column_identifier
: A column name (string) or index (integer).rowid
: The rowid of the target record.
Return Value:
- A string representing the locale (e.g., "en_US", "fr_CA") if specified during tokenization.
NULL
if no locale was set for the column or row.
Example Workflow:
Schema Setup
Create an FTS5 table with locale assignments:CREATE VIRTUAL TABLE docs USING fts5( title, content, locale='en_US', -- Applies to all columns by default locale_content='fr_CA' -- Overrides locale for 'content' column );
Locale Retrieval
Fetch the locale for rowid=42 in the ‘content’ column:SELECT fts5_get_locale('docs', 'content', 42) AS locale; -- Returns 'fr_CA'
Batch Analysis
Identify all distinct locales used in the ‘content’ column:SELECT DISTINCT fts5_get_locale('docs', 'content', rowid) FROM docs;
Integration with Tokenization Pipelines:
To reconstruct source text from tokens using the correct locale:
- Query the locale for the target row and column.
- Configure the tokenizer with the retrieved locale.
- Re-tokenize the source text to obtain byte offsets matching the original tokenization.
Handling Edge Cases:
- Legacy Tokenizers (v0/v1): If the tokenizer does not support locales,
fts5_get_locale
returnsNULL
. Applications should default to a fallback locale or notify users of incompatibility. - Mixed Locale Configurations: When locales vary by row, iterate through each rowid to ensure accurate per-row locale retrieval.
- Concurrency Considerations: Ensure that locale retrieval and re-tokenization occur within the same transaction to prevent data inconsistency due to concurrent updates.
Performance Optimization:
- Indexed Rowid Queries: Use
WHERE rowid=?
clauses to leverage SQLite’s rowid indexing for fast lookups. - Caching Locales: For static datasets, cache locale values in memory to avoid repetitive queries.
Debugging and Validation:
- Use
fts5_vocab
virtual table to cross-reference tokens with source text segments. For example:SELECT v.term, fts5_get_locale('docs', v.col, v.rowid) AS locale FROM docs_vocab v WHERE v.term = 'run';
- Validate byte offsets by comparing re-tokenized text (using
fts5_get_locale
) with the original content.
Migration Strategy for Existing Systems:
- Update SQLite to a version containing the
fts5_locale
enhancement. - Modify FTS5 table definitions to include locale parameters where necessary.
- Refactor content reconstruction logic to use
fts5_get_locale
instead of hard-coded locale assumptions. - Audit existing data to ensure locale metadata consistency using
DISTINCT
queries.
By adopting fts5_get_locale
, developers eliminate the risk of incorrect token-source alignment caused by locale mismatches. This function bridges the gap between FTS5’s internal tokenization state and external diagnostic or content-reconstruction tools, ensuring robust handling of multilingual content and complex tokenization rules.