FTS5 Diacritic Filtering Issue with Ø and Unicode Decomposition

Issue Overview: FTS5 Fails to Match "O" with "Ø" Despite remove_diacritics=2 Configuration

The problem involves SQLite’s FTS5 full-text search engine failing to match the character "O" against "Ø" (Latin capital letter O with stroke) when using the unicode61 tokenizer with remove_diacritics=2. The user’s configuration assumes that all diacritic-modified versions of "O" will be normalized to the base character "O" during tokenization. While this works for characters like "Ô" (O with circumflex), it fails for "Ø". The discrepancy arises because Unicode’s decomposition rules do not treat "Ø" as a diacritic-modified variant of "O" but rather as a distinct character with no canonical decomposition defined. This creates an inconsistency in search behavior between characters with decompositions (e.g., "Ô") and those without (e.g., "Ø").

The FTS5 virtual table is configured to strip diacritics by normalizing Unicode characters to their base forms. However, the Unicode standard defines some characters as "compatibility decompositions" (stored in the Unicode Character Database) and others as "canonical decompositions". Characters like "Ø" fall outside these decomposition rules, meaning they are not algorithmically mapped to "O" during normalization. This results in the FTS5 index treating "Ø" as a separate token, requiring an exact match in queries. The core challenge lies in reconciling user expectations of diacritic-agnostic search with the limitations of Unicode normalization and SQLite’s implementation.

Possible Causes: Unicode Decomposition Gaps and Tokenizer Configuration

1. Unicode Normalization Rules Exclude "Ø" from Decomposition

The Unicode standard assigns "Ø" (U+00D8) the status of a distinct letter in Scandinavian alphabets rather than a diacritic-modified "O". Its decomposition field in the Unicode Character Database (UCD) is empty, unlike "Ô" (U+00D4), which decomposes to "O" + " ̂" (combining circumflex). The unicode61 tokenizer relies on these decomposition rules to remove diacritics. Since "Ø" lacks a decomposition, it remains unaltered during tokenization, even with remove_diacritics=2.

2. Misunderstanding remove_diacritics=2 Scope

The remove_diacritics option in FTS5’s unicode61 tokenizer has three modes:

  • 0: Disable diacritic removal.
  • 1: Remove diacritics from ASCII characters only (default).
  • 2: Remove diacritics from all Unicode characters.

While remove_diacritics=2 extends diacritic stripping to non-Latin scripts, it does not invent decompositions for characters where Unicode does not define them. Users might incorrectly assume this option forces heuristic diacritic removal, but it strictly follows Unicode decomposition rules.

3. FTS5 Tokenization Workflow Limitations

FTS5 tokenization occurs in stages:

  1. Normalization: Convert text to NFC (Normalization Form C).
  2. Diacritic Removal: Apply Unicode decomposition and strip combining characters.
  3. Case Folding: Convert to lowercase (configurable).
  4. Separator Splitting: Break tokens at whitespace/punctuation.

The failure to map "Ø" to "O" occurs at the diacritic removal stage. Since "Ø" does not decompose, it bypasses this step entirely. This contrasts with "Ô", which decomposes into "O" (U+004F) and " ̂" (U+0302), allowing the diacritic (U+0302) to be stripped.

Troubleshooting Steps, Solutions & Fixes

1. Verify Unicode Decomposition Status

Use official Unicode resources to confirm whether a character has a decomposition:

  • Unicode Character Database (UCD): Query the "Decomposition_Mapping" field for "Ø" (U+00D8). This returns an empty string, indicating no decomposition.
  • Unicode Utilities: Tools like unicodedata in Python:
    import unicodedata
    print(unicodedata.decomposition('Ø'))  # Returns empty string
    print(unicodedata.decomposition('Ô'))  # Returns '004F 0302'
    

This confirms that the issue stems from Unicode’s design, not SQLite.

2. Custom Tokenizer with Override Rules

SQLite allows custom tokenizers via its C API. To handle "Ø" → "O" mapping:

  • Modify the Tokenizer Logic: Override the default unicode61 behavior by adding custom decomposition rules. Example using a hypothetical custom_tokenizer:
    CREATE VIRTUAL TABLE ftsTest USING fts5(
      DESCRIPTION, 
      TEXT_TO_MATCH, 
      content=test, 
      tokenize = 'custom_tokenizer remove_diacritics 2'
    );
    
  • Implement a Mapping Table: Hardcode substitutions for characters like "Ø", "ø", "Ł", "ł", etc., in the tokenizer’s logic.

3. Preprocess Text Before Indexing

Normalize text at insertion time using a deterministic mapping:

  • SQL Trigger-Based Replacement:
    CREATE TRIGGER preprocess_text BEFORE INSERT ON test
    BEGIN
      UPDATE test SET TEXT_TO_MATCH = REPLACE(NEW.TEXT_TO_MATCH, 'Ø', 'O');
    END;
    
  • Application-Level Normalization: Replace "Ø" with "O" in application code before inserting into the test table.

4. Leverage Spellfix1 for Fuzzy Matching

SQLite’s spellfix1 extension supports phonetic matching:

  • Create a Virtual Table:
    CREATE VIRTUAL TABLE fts_fuzzy USING spellfix1;
    INSERT INTO fts_fuzzy(word) SELECT TEXT_TO_MATCH FROM test;
    
  • Query with Top-N Suggestions:
    SELECT * FROM fts_fuzzy WHERE word MATCH 'oz' AND TOP=3;
    

This bypasses FTS5’s tokenization limitations but requires additional storage.

5. Combine FTS5 with LIKE/Regex Filters

Use FTS5 for broad matches and post-filter results:

SELECT * FROM ftsTest 
WHERE TEXT_TO_MATCH MATCH '"Oz"_' 
  OR TEXT_TO_MATCH LIKE '%Øz%';

This ensures "Øz" is included but may impact performance on large datasets.

6. Advocate for Unicode61 Extension

Submit a feature request to the SQLite team to support user-defined decomposition rules in unicode61. This would allow:

tokenize = "unicode61 remove_diacritics 2 custom_mappings='Ø→O, ø→o'"

7. Documentation and User Education

Clarify in project documentation that remove_diacritics=2 adheres strictly to Unicode decomposition rules. Provide a list of affected characters (e.g., "Ø", "Ł", "Đ") and recommend preprocessing steps.

8. Test Alternative Normalization Forms

While SQLite’s unicode61 uses NFC normalization, experiment with NFKC/NFKD forms externally:

# Python example using NFKD
normalized_text = unicodedata.normalize('NFKD', 'Øz').encode('asciistr', 'ignore')

If NFKD decomposes "Ø" (it does not), this could be integrated into preprocessing.

9. Cross-Database Comparison

Compare with other databases’ behavior:

  • PostgreSQL: CREATE INDEX ... USING gin (to_tsvector('simple', TEXT_TO_MATCH)) with unaccent extension.
  • MySQL: ALTER TABLE test ADD FULLTEXT INDEX (TEXT_TO_MATCH) WITH PARSER ngram;
    Highlighting these differences helps users set realistic expectations for SQLite’s FTS5.

10. Fallback to Collation-Based Sorting

Use a custom collation to treat "O" and "Ø" as equivalent:

SELECT * FROM test 
WHERE TEXT_TO_MATCH = 'Oz' COLLATE O_AND_Ø_EQUAL;

Requires implementing a collation via sqlite3_create_collation().

By systematically addressing Unicode’s limitations, SQLite’s tokenization workflow, and application-level workarounds, users can achieve diacritic-agnostic search for characters like "Ø" without relying solely on FTS5’s default behavior.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *