FTS5 Greek Diacritic Insensitivity in Full-Text Search Queries

FTS5 Tokenization Mechanics and Greek Diacritic Matching Limitations

Issue Overview

The core challenge revolves around SQLite’s FTS5 extension failing to treat Greek characters with diacritics (e.g., ά, ϊ, ΰ) as equivalent to their base forms (e.g., α, ι, υ) during full-text searches. By default, FTS5 tokenizes text using rules defined by its tokenizer, which for Unicode-aware tokenization typically employs the unicode61 tokenizer. While unicode61 supports stripping diacritics via the remove_diacritics option, this feature is explicitly limited to Latin script characters. Consequently, Greek diacritics are preserved during tokenization, causing searches for "μέλι" (with acute accent) and "μελι" (without) to yield mutually exclusive results.

This behavior conflicts with linguistic expectations in Greek, where diacritics often denote stress or pronunciation but do not alter lexical identity. For example:

  • “μέλι” (honey) and “μελι” (same word without accent) should match.
  • “προϊόν” (product with diaeresis) and “προιόν” (without) should match.

The default FTS5 configuration treats these as distinct tokens due to Unicode codepoint differences. This creates a usability gap for applications requiring diacritic-insensitive searches in Greek.

Tokenizer Constraints and Unicode Normalization Gaps

1. unicode61 Tokenizer’s Diacritic Removal Limitations

The unicode61 tokenizer’s remove_diacritics option operates on a predefined subset of Unicode characters (Latin, Cyrillic, and Armenian scripts). Greek diacritics are excluded from this logic. The tokenizer splits text into tokens after case-folding and diacritic removal—but only for supported scripts. Greek characters bypass diacritic stripping entirely, resulting in tokens that retain accents and breath marks.

2. Absence of Built-In Greek-Specific Normalization

SQLite lacks built-in mechanisms for normalizing Greek text to a diacritic-insensitive form. Unicode normalization forms (NFD, NFC) decompose or recompose characters but do not remove diacritics. For example:

  • “ά” (U+03AC) decomposes to “α” (U+03B1) + “́” (U+0301) in NFD.
  • FTS5 treats decomposed sequences as separate tokens unless recompiled with custom normalization logic.

3. Tokenizer Customization Overhead

FTS5 allows custom tokenizers, but implementing one requires modifying SQLite’s source code or writing a loadable extension. This is non-trivial for users unfamiliar with SQLite’s internals. Prebuilt extensions like the Snowball stemmer might help but may not address Greek diacritics directly.

Custom Tokenizer Implementation and Query Workarounds

Step 1: Validate Current Tokenizer Behavior

Confirm the default tokenizer’s handling of Greek characters:

-- Create a table with `unicode61` and `remove_diacritics` enabled for illustration:
CREATE VIRTUAL TABLE temp.greek_test USING fts5(
    word, 
    tokenize="unicode61 remove_diacritics 2"
);
INSERT INTO greek_test VALUES ('μέλι μελι προϊόν προιόν');

-- Query for base and accented forms:
SELECT snippet(greek_test, 0, '<b>', '</b>', '…', 10) 
FROM greek_test 
WHERE greek_test MATCH 'μέλι'; 
-- Returns only "μέλι" despite `remove_diacritics 2`

This confirms that Greek diacritics are not stripped.

Step 2: Modify SQLite’s Diacritic Removal Logic

To extend diacritic removal to Greek, modify the fts5_unicode2RemoveDiacritic function in sqlite3.c (or equivalent in amalgamation). Locate the unicode array defining diacritic mappings and add entries for Greek characters:

Example Addition for Greek Acute Accents:

static const unsigned short unicode[] = {
  /* … existing entries … */
  /* Greek lowercase letters with diacritics */
  0x0300, 0x0061,   // Grave accent (ὰ → α)
  0x0301, 0x0061,   // Acute accent (ά → α)
  0x0308, 0x0061,   // Diaeresis (ϊ → ι)
  /* Map Unicode code points for Greek diacritics to their base forms */
  0x03AC, 0x03B1,   // ά → α
  0x03AD, 0x03B5,   // έ → ε
  0x03AE, 0x03B7,   // ή → η
  0x03AF, 0x03B9,   // ί → ι
  0x0390, 0x03B9,   // ΐ → ι
  0x03CC, 0x03BF,   // ό → ο
  0x03CD, 0x03C5,   // ύ → υ
  0x03CE, 0x03C9,   // ώ → ω
  0x03CA, 0x03B9,   // ϊ → ι
  0x03CB, 0x03C5,   // ϋ → υ
};

Recompile SQLite with this modified code.

Step 3: Employ Custom Tokenizer with Extended Diacritic Removal

After recompiling, create an FTS5 table using the modified tokenizer:

CREATE VIRTUAL TABLE greek_word_fts USING fts5(
    word, 
    number, 
    tokenize="unicode61 remove_diacritics 2"
);
INSERT INTO greek_word_fts SELECT * FROM greek_word;

-- Test diacritic-insensitive search:
SELECT * FROM greek_word_fts WHERE greek_word_fts MATCH 'μέλι';
-- Returns both 'μέλι' (1) and 'μελι' (2)

Step 4: Preprocess Data with Unicode Normalization

If modifying SQLite is impractical, preprocess text to remove diacritics before insertion:

  • Use Python’s unicodedata library:
import unicodedata

def strip_diacritics(text):
    return ''.join(
        c for c in unicodedata.normalize('NFD', text)
        if not unicodedata.combining(c)
    )

Apply this function to word values before inserting into FTS5:

INSERT INTO greek_word_fts(word, number) 
VALUES (strip_diacritics('μέλι'), 1), 
       (strip_diacritics('μελι'), 2);

Queries must also apply strip_diacritics() to search terms:

SELECT * FROM greek_word_fts 
WHERE greek_word_fts MATCH strip_diacritics('μέλι');

Step 5: Leverage the Snowball Stemmer Extension

The fts5-snowball extension provides stemming for multiple languages. If Greek is supported, install the extension and create a tokenizer:

-- Load the extension (path varies by OS):
.load './fts5_snowball'

CREATE VIRTUAL TABLE greek_word_fts USING fts5(
    word, 
    number, 
    tokenize="snowball greek"
);

Verify if stemming conflates diacritic variants. If not, combine with preprocessing.

Step 6: Debugging Custom Tokenizers

For developers troubleshooting custom tokenizers:

  1. Compile SQLite with Debug Symbols:
    Ensure CFLAGS="-g" is set during compilation.
  2. Set Breakpoints in fts5_unicode2RemoveDiacritic:
    Use GDB with exact line numbers:

    gdb ./sqlite3
    (gdb) break fts5_unicode.c:1000  # Adjust to actual line
    (gdb) run < test_queries.sql
    
  3. Inspect Tokenization Outputs:
    Use fts5() table-valued function to debug tokenization:

    SELECT * FROM fts5('greek_word_fts', 'tokenize', 'μέλι');
    -- Check if tokens match expected base forms.
    

By systematically addressing tokenizer limitations through code modification, preprocessing, or extensions, FTS5 can be coerced into diacritic-insensitive Greek search. Each approach balances development effort, maintainability, and linguistic accuracy.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *