FTS5 Trigram Tokenizer remove_diacritics Option Causing Runtime Error in SQLite 3.44.2
Issue Overview: Runtime Error During FTS5 Trigram Tokenizer Initialization with remove_diacritics
The core issue involves a runtime error triggered when attempting to create a virtual table using SQLite’s FTS5 extension with the trigram tokenizer and the remove_diacritics
option. The error message Runtime error: error in tokenizer constructor
indicates a failure during the initialization of the tokenizer module. This problem arises specifically when combining the trigram
tokenizer with the remove_diacritics
parameter in SQLite versions prior to 3.45.
Technical Context of FTS5 Trigram Tokenizer and Diacritic Removal
FTS5 (Full-Text Search version 5) is SQLite’s advanced full-text search engine. The trigram tokenizer is a specialized tokenizer that splits text into sequences of three consecutive characters (trigrams), enabling efficient substring and fuzzy matching. The remove_diacritics
option is designed to normalize text by stripping diacritical marks (e.g., converting "é" to "e"), which is critical for language-agnostic searches. When enabled, this option processes text before tokenization, ensuring that diacritics do not interfere with search results.
Error Reproduction Scenario
The error occurs under the following conditions:
- SQLite Version: 3.44.2 or earlier.
- Virtual Table Creation Command:
CREATE VIRTUAL TABLE fts5_T USING fts5(a, content='T', tokenize="trigram remove_diacritics 1");
- Input Data: Text containing diacritics (e.g., "aàbcdeéfghij KLMNOPQRST uvwxyz").
The error manifests during the parsing of the tokenize
parameter string. The tokenizer constructor fails to recognize the remove_diacritics
option, leading to an immediate termination of the virtual table creation process.
Underlying Technical Failure
The trigram
tokenizer in SQLite versions before 3.45 does not include support for the remove_diacritics
option. The tokenizer initialization routine expects a predefined set of parameters, and any unrecognized option (like remove_diacritics
in older versions) results in a constructor error. This is a version-specific limitation tied to the implementation timeline of FTS5 features.
Possible Causes: Version Mismatch, Syntax Misconfiguration, and Build Options
Cause 1: SQLite Version Incompatibility with remove_diacritics
The remove_diacritics
option for the trigram tokenizer was not available in SQLite 3.44.2 or earlier releases. This feature was introduced in subsequent development builds and is scheduled for official release in SQLite 3.45. The 3.44.x series is a patch branch that includes only critical bug fixes and no new features. Attempting to use remove_diacritics
in 3.44.2 will always fail because the code to parse this option is absent from the tokenizer module.
Cause 2: Incorrect Tokenizer Parameter Syntax
While the primary cause is version incompatibility, syntax errors in the tokenize
parameter string can also trigger similar runtime errors. The trigram tokenizer expects parameters in the format:
tokenize="trigram [option1] [option2] ..."
where valid options include case_sensitive
and remove_diacritics
, each followed by an integer (0 or 1) to enable/disable the feature. A misplaced space, missing integer argument, or misspelled option name (e.g., "remove_diacritic" instead of "remove_diacritics") could cause the tokenizer constructor to fail.
Cause 3: FTS5 Extension Not Built with Trigram Support
Although the error message explicitly references the tokenizer constructor (indicating FTS5 is enabled), it is theoretically possible for a custom SQLite build to exclude trigram tokenizer support. The trigram tokenizer is part of the FTS5 extension but requires the SQLITE_ENABLE_FTS5
compile-time flag. If FTS5 is enabled but the trigram tokenizer is not included, any attempt to use tokenize="trigram"
would fail. However, this is unlikely in standard builds or precompiled binaries.
Troubleshooting Steps, Solutions & Fixes: Version Upgrades, Workarounds, and Custom Builds
Step 1: Verify SQLite Version and Feature Availability
Action: Confirm the SQLite version and whether it includes the remove_diacritics
option for the trigram tokenizer.
Command:
SELECT sqlite_version();
Expected Output: A version string of 3.45.0 or higher. For versions ≤3.44.2, the remove_diacritics
option is unavailable.
Resolution: Upgrade to SQLite 3.45 or later once released. Pre-release builds containing the feature can be compiled from the SQLite source code repository.
Step 2: Validate Tokenizer Parameter Syntax
Action: Ensure the tokenize
parameter string follows the exact syntax required by the trigram tokenizer.
Example of Correct Syntax:
CREATE VIRTUAL TABLE fts5_T USING fts5(
a,
content='T',
tokenize="trigram remove_diacritics 1 case_sensitive 0"
);
Key Points:
- Options are space-separated.
- Each option (
remove_diacritics
,case_sensitive
) is followed by 0 (disable) or 1 (enable). - The order of options does not matter.
Step 3: Compile SQLite from Source with Latest FTS5 Patches
Action: Build SQLite from source, integrating the commit that implements remove_diacritics
for the trigram tokenizer.
Required Commit: 0d50172477064dce (implements remove_diacritics
).
Compilation Steps:
- Download the SQLite source code:
fossil clone https://www.sqlite.org/cgi/src sqlite.fossil mkdir sqlite cd sqlite fossil open ../sqlite.fossil
- Update to the specific commit:
fossil update 0d50172477
- Configure and build with FTS5 enabled:
./configure --enable-fts5 make
- Replace the system SQLite binary with the newly built version or use it explicitly in your application.
Step 4: Implement Manual Diacritic Removal as a Workaround
Action: Preprocess text to remove diacritics before inserting it into the FTS5 table. This can be achieved via SQL functions or application-layer logic.
Example Using SQLite User-Defined Function (UDF):
- Register a UDF to Remove Diacritics (e.g., in Python using
sqlite3
):import unicodedata import sqlite3 def remove_diacritics(text): return ''.join( c for c in unicodedata.normalize('NFKD', text) if not unicodedata.combining(c) ) if text else text conn = sqlite3.connect('base_test.db') conn.create_function('remove_diacritics', 1, remove_diacritics)
- Modify the FTS5 Table Creation and Data Insertion:
-- Create a regular table with preprocessed text CREATE TABLE T (a TEXT); INSERT INTO T VALUES(remove_diacritics('aàbcdeéfghij KLMNOPQRST uvwxyz')); -- Create FTS5 table without remove_diacritics option CREATE VIRTUAL TABLE fts5_T USING fts5(a, content='T', tokenize="trigram");
Step 5: Utilize Triggers for Automatic Diacritic Removal
Action: Create database triggers to automatically process text before insertion into the underlying content table.
Trigger Definition:
-- Create a shadow table to store raw text (optional)
CREATE TABLE T_raw (a TEXT);
-- Create a trigger to preprocess text on insertion
CREATE TRIGGER T_preprocess BEFORE INSERT ON T_raw
BEGIN
INSERT INTO T VALUES(remove_diacritics(NEW.a));
END;
-- Insert data into the raw table
INSERT INTO T_raw VALUES('aàbcdeéfghij KLMNOPQRST uvwxyz');
Step 6: Monitor SQLite Release Channels for 3.45 Availability
Action: Track SQLite’s official release announcements to upgrade promptly when version 3.45 becomes available.
Resources:
- SQLite website: https://sqlite.org/index.html
- Mailing list: sqlite-announce
Final Recommendation
For production systems requiring diacritic-insensitive trigram searches, the optimal solution is to wait for SQLite 3.45 and use the built-in remove_diacritics
option. Until then, manual preprocessing via UDFs or triggers provides a functional workaround. Developers testing pre-release builds should ensure thorough validation in non-production environments to avoid stability issues.