FTS5 Ukrainian Text Matching Issues: Cyrillic vs. Latin Character Conflicts
Issue Overview: FTS5 Fails to Match Ukrainian Words with Mixed Cyrillic-Latin Characters
The core challenge arises when using SQLite’s FTS5 virtual table to perform full-text searches on Ukrainian text containing specific characters, such as the Cyrillic letter і (Unicode U+0456, "CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I"). Users report that queries with wildcard operators (e.g., усмi*
) fail to return expected results for tokens like "Усмiшки", while underscore-based wildcards (e.g., усм_
) work. This discrepancy stems from a mismatch between the character encoding of the indexed data and the search input. Specifically, the indexed text contains the Cyrillic і, while the search query inadvertently uses the Latin i (Unicode U+0069, "LATIN SMALL LETTER I").
FTS5 tokenizes text based on configurable rules, and its default behavior treats Cyrillic and Latin characters as distinct entities. When a search term includes a Latin i
(e.g., усмi*
), FTS5 looks for tokens starting with the Latin i
, which do not exist in the indexed Ukrainian text. Conversely, the underscore operator _
matches any single character, allowing it to bypass the encoding mismatch. For example, усм_
matches "Усмiшки" because the underscore ignores whether the fourth character is Cyrillic or Latin. This behavior highlights a critical nuance in FTS5’s tokenization logic: wildcard operators (*
) depend on precise character matching at token boundaries, while underscores (_
) operate at the character level within tokens.
The root cause is not a bug in SQLite but a collision between the character sets used during data ingestion and querying. Ukrainian text often includes the Cyrillic і, which is visually similar to the Latin i but occupies a different Unicode code point. If the application or user input mechanism (e.g., a keyboard layout) inserts Latin i
into search terms, FTS5 will fail to match Cyrillic і
due to their distinct Unicode representations. This issue is exacerbated by the default tokenizer’s inability to normalize these characters automatically, as its remove_diacritics
option primarily targets Latin-based diacritics (e.g., é → e).
Possible Causes: Encoding Mismatches, Tokenizer Limitations, and Input Handling Flaws
1. Unicode Encoding Discrepancies Between Data and Queries
The Ukrainian language uses the Cyrillic alphabet, which includes characters that resemble Latin letters but have distinct Unicode code points. For example:
- Cyrillic
і
: Unicode U+0456 (UTF-8:0xD1 0x96
) - Latin
i
: Unicode U+0069 (UTF-8:0x69
)
When text is inserted into an FTS5 table, the tokenizer processes raw byte sequences. If the indexed data contains Cyrillic і
but the search query uses Latin i
, FTS5 treats them as different tokens. This is not a bug but a feature of Unicode-aware tokenization: characters from different Unicode blocks are not conflated unless explicitly normalized.
2. Inadequate Tokenizer Configuration for Cyrillic Text
FTS5’s default tokenizer (unicode61
) splits text into tokens based on Unicode 6.1 whitespace and punctuation rules. The remove_diacritics
option removes diacritical marks from Latin characters (e.g., ç
→ c
), but it does not handle equivalences between Cyrillic and Latin letters. For example:
CREATE VIRTUAL TABLE books_fts USING fts5(
title,
content='',
tokenize='unicode61 remove_diacritics 2'
);
This configuration removes diacritics from Latin characters but leaves Cyrillic і
and Latin i
as separate entities. Consequently, усмi*
(Latin i
) will not match Усмiшки
(Cyrillic і
), even if the rest of the characters align.
3. Input Source Inconsistencies (Keyboard Layouts, APIs, etc.)
User input mechanisms often contribute to encoding mismatches. For instance:
- A user typing on a Latin keyboard may inadvertently enter
i
instead of switching to a Ukrainian layout forі
. - Applications that programmatically generate search terms might fail to enforce Unicode normalization, leading to mixed encoding in queries.
- Database collations or client-side encoding conversions might silently alter characters, though SQLite generally stores and retrieves UTF-8 data verbatim.
Troubleshooting Steps, Solutions & Fixes: Resolving Encoding Conflicts in FTS5 Queries
Step 1: Verify the Actual Encoding of Stored Data and Queries
Before modifying tokenizers or application logic, confirm the exact byte sequences stored in the FTS5 table and those used in search queries. Use SQLite’s HEX()
function to inspect the raw bytes of both the indexed data and the search parameter:
-- Check stored data
SELECT title, HEX(title) FROM books_fts WHERE rowid = 1;
-- Example output for "Усмiшки":
-- Усмiшки | D0A3D181D196D188D0BAD0B8
-- Inspect the search parameter
SELECT HEX('усмi*');
-- If the 'i' is Latin, output will include '69':
-- 75D181D0BC69
In the example above, the Cyrillic і
in "Усмiшки" corresponds to D196
(UTF-8 for U+0456), while a Latin i
in the search term would appear as 69
. If the HEX outputs show differing byte sequences for the i
/і
characters, the encoding mismatch is confirmed.
Step 2: Normalize Inputs to Use Consistent Unicode Characters
Ensure that both the indexed data and search queries use the same Unicode representation for visually similar characters. This can be achieved through:
A. Client-Side Unicode Normalization
Implement input sanitization in the application layer to replace Latin i
with Cyrillic і
in Ukrainian text contexts. For example, in Python:
def normalize_ukrainian_i(text):
return text.replace('i', '\u0456')
B. Database-Level Normalization with User-Defined Functions (UDFs)
SQLite allows registering UDFs to preprocess text before insertion or searching. Create a function that replaces Latin i
with Cyrillic і
:
import sqlite3
def normalize_text(text):
return text.replace('i', '\u0456')
conn = sqlite3.connect(':memory:')
conn.create_function('NORMALIZE_UKR', 1, normalize_text)
# Usage in queries
cursor = conn.execute(
"SELECT rowid FROM books_fts WHERE title MATCH NORMALIZE_UKR(?)",
('усмi*',)
)
Step 3: Custom Tokenizers for Cyrillic-Latin Equivalence
If normalizing inputs is impractical, extend FTS5’s tokenization logic to treat Cyrillic і
and Latin i
as equivalent. This requires creating a custom tokenizer using SQLite’s C API. Below is a conceptual outline:
- Subclass the
unicode61
tokenizer and override itsxTokenize
method to mapi
(U+0069) toі
(U+0456) during tokenization. - Compile the custom tokenizer as a SQLite extension.
- Configure the FTS5 table to use the custom tokenizer.
Example C snippet for a custom tokenizer:
#include "sqlite3.h"
#include "fts5.h"
static int xTokenize(
Fts5Tokenizer *pTokenizer,
void *pCtx,
int flags,
const char *pText,
int nText,
int (*xToken)(void *pCtx, int tflags, const char *pToken, int nToken, int iStart, int iEnd)
){
// Replace Latin 'i' (0x69) with Cyrillic 'і' (0xD196) in UTF-8
// (pseudo-code for brevity)
char *modified = replace_char(pText, nText, 0x69, "\xD1\x96");
return unicode61_tokenize(modified, xToken, pCtx);
}
// Register the tokenizer with FTS5
sqlite3_fts5_create_tokenizer(db, "cyrillic_latin", pApi, pUserData);
After compiling and loading the extension, create the FTS5 table with:
CREATE VIRTUAL TABLE books_fts USING fts5(
title,
content='',
tokenize='cyrillic_latin'
);
Step 4: Educate Users on Keyboard Layouts and Input Methods
If the application relies on manual text entry, guide users to switch to a Ukrainian keyboard layout when typing search terms. Provide visual cues or input validation to detect Latin i
in Ukrainian contexts and suggest corrections. For web applications, consider integrating virtual keyboards or input method editors (IMEs) that enforce Cyrillic input.
Step 5: Use Triggers to Normalize Data at Insertion
To ensure consistency in the indexed data, create a trigger that normalizes text before it enters the FTS5 table:
-- Create a shadow table for raw data
CREATE TABLE books (
id INTEGER PRIMARY KEY,
title TEXT
);
-- Create the FTS5 table
CREATE VIRTUAL TABLE books_fts USING fts5(title, content='');
-- Trigger to normalize on insertion
CREATE TRIGGER books_ai AFTER INSERT ON books BEGIN
INSERT INTO books_fts (rowid, title)
VALUES (new.id, REPLACE(new.title, 'i', char(0x0456)));
END;
This approach ensures that all instances of Latin i
in inserted titles are converted to Cyrillic і
, making queries with either character variant match the normalized data.
Step 6: Leverage FTS5’s Prefix Indexing with Custom Scripts
If modifying the tokenizer is not feasible, use auxiliary scripts to generate prefix-expanded queries. For example, expand усмi*
into (усмi* OR усмі*)
to cover both Latin and Cyrillic i
:
def expand_query(query):
cyrillic_i = '\u0456'
latin_i = 'i'
return query.replace(latin_i, f'[{latin_i}{cyrillic_i}]')
Apply this expansion before passing the query to SQLite:
SELECT rowid FROM books_fts WHERE title MATCH 'усм[iі]*';
This regex-style pattern matches both i
and і
at the critical position, ensuring the wildcard operator captures the intended tokens.
By systematically addressing encoding discrepancies, refining tokenization rules, and normalizing inputs, developers can resolve FTS5’s failure to match Ukrainian text with mixed Cyrillic-Latin characters. The solutions range from immediate client-side fixes to long-term architectural changes, ensuring robust full-text search capabilities for multilingual applications.