FTS5 Trigram Tokenizer remove_diacritics Option Causing Runtime Error in SQLite 3.44.2

Issue Overview: Runtime Error During FTS5 Trigram Tokenizer Initialization with remove_diacritics

The core issue involves a runtime error triggered when attempting to create a virtual table using SQLite’s FTS5 extension with the trigram tokenizer and the remove_diacritics option. The error message Runtime error: error in tokenizer constructor indicates a failure during the initialization of the tokenizer module. This problem arises specifically when combining the trigram tokenizer with the remove_diacritics parameter in SQLite versions prior to 3.45.

Technical Context of FTS5 Trigram Tokenizer and Diacritic Removal

FTS5 (Full-Text Search version 5) is SQLite’s advanced full-text search engine. The trigram tokenizer is a specialized tokenizer that splits text into sequences of three consecutive characters (trigrams), enabling efficient substring and fuzzy matching. The remove_diacritics option is designed to normalize text by stripping diacritical marks (e.g., converting "é" to "e"), which is critical for language-agnostic searches. When enabled, this option processes text before tokenization, ensuring that diacritics do not interfere with search results.

Error Reproduction Scenario

The error occurs under the following conditions:

  1. SQLite Version: 3.44.2 or earlier.
  2. Virtual Table Creation Command:
    CREATE VIRTUAL TABLE fts5_T USING fts5(a, content='T', tokenize="trigram remove_diacritics 1");
    
  3. Input Data: Text containing diacritics (e.g., "aàbcdeéfghij KLMNOPQRST uvwxyz").

The error manifests during the parsing of the tokenize parameter string. The tokenizer constructor fails to recognize the remove_diacritics option, leading to an immediate termination of the virtual table creation process.

Underlying Technical Failure

The trigram tokenizer in SQLite versions before 3.45 does not include support for the remove_diacritics option. The tokenizer initialization routine expects a predefined set of parameters, and any unrecognized option (like remove_diacritics in older versions) results in a constructor error. This is a version-specific limitation tied to the implementation timeline of FTS5 features.

Possible Causes: Version Mismatch, Syntax Misconfiguration, and Build Options

Cause 1: SQLite Version Incompatibility with remove_diacritics

The remove_diacritics option for the trigram tokenizer was not available in SQLite 3.44.2 or earlier releases. This feature was introduced in subsequent development builds and is scheduled for official release in SQLite 3.45. The 3.44.x series is a patch branch that includes only critical bug fixes and no new features. Attempting to use remove_diacritics in 3.44.2 will always fail because the code to parse this option is absent from the tokenizer module.

Cause 2: Incorrect Tokenizer Parameter Syntax

While the primary cause is version incompatibility, syntax errors in the tokenize parameter string can also trigger similar runtime errors. The trigram tokenizer expects parameters in the format:

tokenize="trigram [option1] [option2] ..."

where valid options include case_sensitive and remove_diacritics, each followed by an integer (0 or 1) to enable/disable the feature. A misplaced space, missing integer argument, or misspelled option name (e.g., "remove_diacritic" instead of "remove_diacritics") could cause the tokenizer constructor to fail.

Cause 3: FTS5 Extension Not Built with Trigram Support

Although the error message explicitly references the tokenizer constructor (indicating FTS5 is enabled), it is theoretically possible for a custom SQLite build to exclude trigram tokenizer support. The trigram tokenizer is part of the FTS5 extension but requires the SQLITE_ENABLE_FTS5 compile-time flag. If FTS5 is enabled but the trigram tokenizer is not included, any attempt to use tokenize="trigram" would fail. However, this is unlikely in standard builds or precompiled binaries.

Troubleshooting Steps, Solutions & Fixes: Version Upgrades, Workarounds, and Custom Builds

Step 1: Verify SQLite Version and Feature Availability

Action: Confirm the SQLite version and whether it includes the remove_diacritics option for the trigram tokenizer.
Command:

SELECT sqlite_version();

Expected Output: A version string of 3.45.0 or higher. For versions ≤3.44.2, the remove_diacritics option is unavailable.
Resolution: Upgrade to SQLite 3.45 or later once released. Pre-release builds containing the feature can be compiled from the SQLite source code repository.

Step 2: Validate Tokenizer Parameter Syntax

Action: Ensure the tokenize parameter string follows the exact syntax required by the trigram tokenizer.
Example of Correct Syntax:

CREATE VIRTUAL TABLE fts5_T USING fts5(
  a, 
  content='T', 
  tokenize="trigram remove_diacritics 1 case_sensitive 0"
);

Key Points:

  • Options are space-separated.
  • Each option (remove_diacritics, case_sensitive) is followed by 0 (disable) or 1 (enable).
  • The order of options does not matter.

Step 3: Compile SQLite from Source with Latest FTS5 Patches

Action: Build SQLite from source, integrating the commit that implements remove_diacritics for the trigram tokenizer.
Required Commit: 0d50172477064dce (implements remove_diacritics).

Compilation Steps:

  1. Download the SQLite source code:
    fossil clone https://www.sqlite.org/cgi/src sqlite.fossil
    mkdir sqlite
    cd sqlite
    fossil open ../sqlite.fossil
    
  2. Update to the specific commit:
    fossil update 0d50172477
    
  3. Configure and build with FTS5 enabled:
    ./configure --enable-fts5
    make
    
  4. Replace the system SQLite binary with the newly built version or use it explicitly in your application.

Step 4: Implement Manual Diacritic Removal as a Workaround

Action: Preprocess text to remove diacritics before inserting it into the FTS5 table. This can be achieved via SQL functions or application-layer logic.

Example Using SQLite User-Defined Function (UDF):

  1. Register a UDF to Remove Diacritics (e.g., in Python using sqlite3):
    import unicodedata
    import sqlite3
    
    def remove_diacritics(text):
        return ''.join(
            c for c in unicodedata.normalize('NFKD', text) 
            if not unicodedata.combining(c)
        ) if text else text
    
    conn = sqlite3.connect('base_test.db')
    conn.create_function('remove_diacritics', 1, remove_diacritics)
    
  2. Modify the FTS5 Table Creation and Data Insertion:
    -- Create a regular table with preprocessed text
    CREATE TABLE T (a TEXT);
    INSERT INTO T VALUES(remove_diacritics('aàbcdeéfghij KLMNOPQRST uvwxyz'));
    
    -- Create FTS5 table without remove_diacritics option
    CREATE VIRTUAL TABLE fts5_T USING fts5(a, content='T', tokenize="trigram");
    

Step 5: Utilize Triggers for Automatic Diacritic Removal

Action: Create database triggers to automatically process text before insertion into the underlying content table.

Trigger Definition:

-- Create a shadow table to store raw text (optional)
CREATE TABLE T_raw (a TEXT);

-- Create a trigger to preprocess text on insertion
CREATE TRIGGER T_preprocess BEFORE INSERT ON T_raw
BEGIN
  INSERT INTO T VALUES(remove_diacritics(NEW.a));
END;

-- Insert data into the raw table
INSERT INTO T_raw VALUES('aàbcdeéfghij KLMNOPQRST uvwxyz');

Step 6: Monitor SQLite Release Channels for 3.45 Availability

Action: Track SQLite’s official release announcements to upgrade promptly when version 3.45 becomes available.
Resources:

Final Recommendation

For production systems requiring diacritic-insensitive trigram searches, the optimal solution is to wait for SQLite 3.45 and use the built-in remove_diacritics option. Until then, manual preprocessing via UDFs or triggers provides a functional workaround. Developers testing pre-release builds should ensure thorough validation in non-production environments to avoid stability issues.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *