FTS5 Diacritic Filtering Issue with Ø and Unicode Decomposition
Issue Overview: FTS5 Fails to Match "O" with "Ø" Despite remove_diacritics=2 Configuration
The problem involves SQLite’s FTS5 full-text search engine failing to match the character "O" against "Ø" (Latin capital letter O with stroke) when using the unicode61
tokenizer with remove_diacritics=2
. The user’s configuration assumes that all diacritic-modified versions of "O" will be normalized to the base character "O" during tokenization. While this works for characters like "Ô" (O with circumflex), it fails for "Ø". The discrepancy arises because Unicode’s decomposition rules do not treat "Ø" as a diacritic-modified variant of "O" but rather as a distinct character with no canonical decomposition defined. This creates an inconsistency in search behavior between characters with decompositions (e.g., "Ô") and those without (e.g., "Ø").
The FTS5 virtual table is configured to strip diacritics by normalizing Unicode characters to their base forms. However, the Unicode standard defines some characters as "compatibility decompositions" (stored in the Unicode Character Database) and others as "canonical decompositions". Characters like "Ø" fall outside these decomposition rules, meaning they are not algorithmically mapped to "O" during normalization. This results in the FTS5 index treating "Ø" as a separate token, requiring an exact match in queries. The core challenge lies in reconciling user expectations of diacritic-agnostic search with the limitations of Unicode normalization and SQLite’s implementation.
Possible Causes: Unicode Decomposition Gaps and Tokenizer Configuration
1. Unicode Normalization Rules Exclude "Ø" from Decomposition
The Unicode standard assigns "Ø" (U+00D8) the status of a distinct letter in Scandinavian alphabets rather than a diacritic-modified "O". Its decomposition field in the Unicode Character Database (UCD) is empty, unlike "Ô" (U+00D4), which decomposes to "O" + " ̂" (combining circumflex). The unicode61
tokenizer relies on these decomposition rules to remove diacritics. Since "Ø" lacks a decomposition, it remains unaltered during tokenization, even with remove_diacritics=2
.
2. Misunderstanding remove_diacritics=2 Scope
The remove_diacritics
option in FTS5’s unicode61
tokenizer has three modes:
- 0: Disable diacritic removal.
- 1: Remove diacritics from ASCII characters only (default).
- 2: Remove diacritics from all Unicode characters.
While remove_diacritics=2
extends diacritic stripping to non-Latin scripts, it does not invent decompositions for characters where Unicode does not define them. Users might incorrectly assume this option forces heuristic diacritic removal, but it strictly follows Unicode decomposition rules.
3. FTS5 Tokenization Workflow Limitations
FTS5 tokenization occurs in stages:
- Normalization: Convert text to NFC (Normalization Form C).
- Diacritic Removal: Apply Unicode decomposition and strip combining characters.
- Case Folding: Convert to lowercase (configurable).
- Separator Splitting: Break tokens at whitespace/punctuation.
The failure to map "Ø" to "O" occurs at the diacritic removal stage. Since "Ø" does not decompose, it bypasses this step entirely. This contrasts with "Ô", which decomposes into "O" (U+004F) and " ̂" (U+0302), allowing the diacritic (U+0302) to be stripped.
Troubleshooting Steps, Solutions & Fixes
1. Verify Unicode Decomposition Status
Use official Unicode resources to confirm whether a character has a decomposition:
- Unicode Character Database (UCD): Query the "Decomposition_Mapping" field for "Ø" (U+00D8). This returns an empty string, indicating no decomposition.
- Unicode Utilities: Tools like
unicodedata
in Python:import unicodedata print(unicodedata.decomposition('Ø')) # Returns empty string print(unicodedata.decomposition('Ô')) # Returns '004F 0302'
This confirms that the issue stems from Unicode’s design, not SQLite.
2. Custom Tokenizer with Override Rules
SQLite allows custom tokenizers via its C API. To handle "Ø" → "O" mapping:
- Modify the Tokenizer Logic: Override the default
unicode61
behavior by adding custom decomposition rules. Example using a hypotheticalcustom_tokenizer
:CREATE VIRTUAL TABLE ftsTest USING fts5( DESCRIPTION, TEXT_TO_MATCH, content=test, tokenize = 'custom_tokenizer remove_diacritics 2' );
- Implement a Mapping Table: Hardcode substitutions for characters like "Ø", "ø", "Ł", "ł", etc., in the tokenizer’s logic.
3. Preprocess Text Before Indexing
Normalize text at insertion time using a deterministic mapping:
- SQL Trigger-Based Replacement:
CREATE TRIGGER preprocess_text BEFORE INSERT ON test BEGIN UPDATE test SET TEXT_TO_MATCH = REPLACE(NEW.TEXT_TO_MATCH, 'Ø', 'O'); END;
- Application-Level Normalization: Replace "Ø" with "O" in application code before inserting into the
test
table.
4. Leverage Spellfix1 for Fuzzy Matching
SQLite’s spellfix1
extension supports phonetic matching:
- Create a Virtual Table:
CREATE VIRTUAL TABLE fts_fuzzy USING spellfix1; INSERT INTO fts_fuzzy(word) SELECT TEXT_TO_MATCH FROM test;
- Query with Top-N Suggestions:
SELECT * FROM fts_fuzzy WHERE word MATCH 'oz' AND TOP=3;
This bypasses FTS5’s tokenization limitations but requires additional storage.
5. Combine FTS5 with LIKE/Regex Filters
Use FTS5 for broad matches and post-filter results:
SELECT * FROM ftsTest
WHERE TEXT_TO_MATCH MATCH '"Oz"_'
OR TEXT_TO_MATCH LIKE '%Øz%';
This ensures "Øz" is included but may impact performance on large datasets.
6. Advocate for Unicode61 Extension
Submit a feature request to the SQLite team to support user-defined decomposition rules in unicode61
. This would allow:
tokenize = "unicode61 remove_diacritics 2 custom_mappings='Ø→O, ø→o'"
7. Documentation and User Education
Clarify in project documentation that remove_diacritics=2
adheres strictly to Unicode decomposition rules. Provide a list of affected characters (e.g., "Ø", "Ł", "Đ") and recommend preprocessing steps.
8. Test Alternative Normalization Forms
While SQLite’s unicode61
uses NFC normalization, experiment with NFKC/NFKD forms externally:
# Python example using NFKD
normalized_text = unicodedata.normalize('NFKD', 'Øz').encode('asciistr', 'ignore')
If NFKD decomposes "Ø" (it does not), this could be integrated into preprocessing.
9. Cross-Database Comparison
Compare with other databases’ behavior:
- PostgreSQL:
CREATE INDEX ... USING gin (to_tsvector('simple', TEXT_TO_MATCH))
withunaccent
extension. - MySQL:
ALTER TABLE test ADD FULLTEXT INDEX (TEXT_TO_MATCH) WITH PARSER ngram;
Highlighting these differences helps users set realistic expectations for SQLite’s FTS5.
10. Fallback to Collation-Based Sorting
Use a custom collation to treat "O" and "Ø" as equivalent:
SELECT * FROM test
WHERE TEXT_TO_MATCH = 'Oz' COLLATE O_AND_Ø_EQUAL;
Requires implementing a collation via sqlite3_create_collation()
.
By systematically addressing Unicode’s limitations, SQLite’s tokenization workflow, and application-level workarounds, users can achieve diacritic-agnostic search for characters like "Ø" without relying solely on FTS5’s default behavior.