FTS5 Greek Diacritic Insensitivity in Full-Text Search Queries
FTS5 Tokenization Mechanics and Greek Diacritic Matching Limitations
Issue Overview
The core challenge revolves around SQLite’s FTS5 extension failing to treat Greek characters with diacritics (e.g., ά, ϊ, ΰ) as equivalent to their base forms (e.g., α, ι, υ) during full-text searches. By default, FTS5 tokenizes text using rules defined by its tokenizer, which for Unicode-aware tokenization typically employs the unicode61
tokenizer. While unicode61
supports stripping diacritics via the remove_diacritics
option, this feature is explicitly limited to Latin script characters. Consequently, Greek diacritics are preserved during tokenization, causing searches for "μέλι" (with acute accent) and "μελι" (without) to yield mutually exclusive results.
This behavior conflicts with linguistic expectations in Greek, where diacritics often denote stress or pronunciation but do not alter lexical identity. For example:
- “μέλι” (honey) and “μελι” (same word without accent) should match.
- “προϊόν” (product with diaeresis) and “προιόν” (without) should match.
The default FTS5 configuration treats these as distinct tokens due to Unicode codepoint differences. This creates a usability gap for applications requiring diacritic-insensitive searches in Greek.
Tokenizer Constraints and Unicode Normalization Gaps
1. unicode61
Tokenizer’s Diacritic Removal Limitations
The unicode61
tokenizer’s remove_diacritics
option operates on a predefined subset of Unicode characters (Latin, Cyrillic, and Armenian scripts). Greek diacritics are excluded from this logic. The tokenizer splits text into tokens after case-folding and diacritic removal—but only for supported scripts. Greek characters bypass diacritic stripping entirely, resulting in tokens that retain accents and breath marks.
2. Absence of Built-In Greek-Specific Normalization
SQLite lacks built-in mechanisms for normalizing Greek text to a diacritic-insensitive form. Unicode normalization forms (NFD, NFC) decompose or recompose characters but do not remove diacritics. For example:
- “ά” (U+03AC) decomposes to “α” (U+03B1) + “́” (U+0301) in NFD.
- FTS5 treats decomposed sequences as separate tokens unless recompiled with custom normalization logic.
3. Tokenizer Customization Overhead
FTS5 allows custom tokenizers, but implementing one requires modifying SQLite’s source code or writing a loadable extension. This is non-trivial for users unfamiliar with SQLite’s internals. Prebuilt extensions like the Snowball stemmer might help but may not address Greek diacritics directly.
Custom Tokenizer Implementation and Query Workarounds
Step 1: Validate Current Tokenizer Behavior
Confirm the default tokenizer’s handling of Greek characters:
-- Create a table with `unicode61` and `remove_diacritics` enabled for illustration:
CREATE VIRTUAL TABLE temp.greek_test USING fts5(
word,
tokenize="unicode61 remove_diacritics 2"
);
INSERT INTO greek_test VALUES ('μέλι μελι προϊόν προιόν');
-- Query for base and accented forms:
SELECT snippet(greek_test, 0, '<b>', '</b>', '…', 10)
FROM greek_test
WHERE greek_test MATCH 'μέλι';
-- Returns only "μέλι" despite `remove_diacritics 2`
This confirms that Greek diacritics are not stripped.
Step 2: Modify SQLite’s Diacritic Removal Logic
To extend diacritic removal to Greek, modify the fts5_unicode2RemoveDiacritic
function in sqlite3.c
(or equivalent in amalgamation). Locate the unicode
array defining diacritic mappings and add entries for Greek characters:
Example Addition for Greek Acute Accents:
static const unsigned short unicode[] = {
/* … existing entries … */
/* Greek lowercase letters with diacritics */
0x0300, 0x0061, // Grave accent (ὰ → α)
0x0301, 0x0061, // Acute accent (ά → α)
0x0308, 0x0061, // Diaeresis (ϊ → ι)
/* Map Unicode code points for Greek diacritics to their base forms */
0x03AC, 0x03B1, // ά → α
0x03AD, 0x03B5, // έ → ε
0x03AE, 0x03B7, // ή → η
0x03AF, 0x03B9, // ί → ι
0x0390, 0x03B9, // ΐ → ι
0x03CC, 0x03BF, // ό → ο
0x03CD, 0x03C5, // ύ → υ
0x03CE, 0x03C9, // ώ → ω
0x03CA, 0x03B9, // ϊ → ι
0x03CB, 0x03C5, // ϋ → υ
};
Recompile SQLite with this modified code.
Step 3: Employ Custom Tokenizer with Extended Diacritic Removal
After recompiling, create an FTS5 table using the modified tokenizer:
CREATE VIRTUAL TABLE greek_word_fts USING fts5(
word,
number,
tokenize="unicode61 remove_diacritics 2"
);
INSERT INTO greek_word_fts SELECT * FROM greek_word;
-- Test diacritic-insensitive search:
SELECT * FROM greek_word_fts WHERE greek_word_fts MATCH 'μέλι';
-- Returns both 'μέλι' (1) and 'μελι' (2)
Step 4: Preprocess Data with Unicode Normalization
If modifying SQLite is impractical, preprocess text to remove diacritics before insertion:
- Use Python’s
unicodedata
library:
import unicodedata
def strip_diacritics(text):
return ''.join(
c for c in unicodedata.normalize('NFD', text)
if not unicodedata.combining(c)
)
Apply this function to word
values before inserting into FTS5:
INSERT INTO greek_word_fts(word, number)
VALUES (strip_diacritics('μέλι'), 1),
(strip_diacritics('μελι'), 2);
Queries must also apply strip_diacritics()
to search terms:
SELECT * FROM greek_word_fts
WHERE greek_word_fts MATCH strip_diacritics('μέλι');
Step 5: Leverage the Snowball Stemmer Extension
The fts5-snowball extension provides stemming for multiple languages. If Greek is supported, install the extension and create a tokenizer:
-- Load the extension (path varies by OS):
.load './fts5_snowball'
CREATE VIRTUAL TABLE greek_word_fts USING fts5(
word,
number,
tokenize="snowball greek"
);
Verify if stemming conflates diacritic variants. If not, combine with preprocessing.
Step 6: Debugging Custom Tokenizers
For developers troubleshooting custom tokenizers:
- Compile SQLite with Debug Symbols:
EnsureCFLAGS="-g"
is set during compilation. - Set Breakpoints in
fts5_unicode2RemoveDiacritic
:
Use GDB with exact line numbers:gdb ./sqlite3 (gdb) break fts5_unicode.c:1000 # Adjust to actual line (gdb) run < test_queries.sql
- Inspect Tokenization Outputs:
Usefts5()
table-valued function to debug tokenization:SELECT * FROM fts5('greek_word_fts', 'tokenize', 'μέλι'); -- Check if tokens match expected base forms.
By systematically addressing tokenizer limitations through code modification, preprocessing, or extensions, FTS5 can be coerced into diacritic-insensitive Greek search. Each approach balances development effort, maintainability, and linguistic accuracy.