Fixing Short Non-ASCII String Search Failures in SQLite FTS5 Virtual Tables
Understanding FTS5 Tokenization Limitations with Short Non-ASCII Patterns
Issue Overview
The core challenge involves inconsistent search behavior when querying short (1-2 character) non-ASCII substrings in SQLite FTS5 virtual tables. While 3+ character searches for both ASCII and non-ASCII patterns work as expected via MATCH
, LIKE
, and GLOB
operators, shorter non-ASCII queries fail to return valid matches. This manifests in scenarios where:
- Cyrillic 2-character patterns like
%ир%
yield no results for filenames containing "жираф" - Emoji/Unicode 2-character patterns like
%-🎵%
fail to locate entries with "🎵" symbols - ASCII 2-character patterns using
LIKE
/GLOB
function correctly (e.g.,%af%
finds "giraffe")
The anomaly occurs specifically within FTS5 virtual tables configured with the trigram tokenizer (tokenize="trigram"
). Standard SQLite tables handle these short non-ASCII patterns correctly with LIKE
/GLOB
, confirming the issue is tied to FTS5’s implementation. The problem creates critical usability gaps in applications requiring flexible substring matching for multilingual content.
Root Causes of Tokenization and Pattern Matching Failures
Three interrelated factors explain the observed behavior:
1. Trigram Tokenizer Minimum Token Length Requirements
The trigram tokenizer splits text into contiguous 3-character sequences. For "жираф.png", this produces tokens like "жир", "ира", "раф", and "аф.". A 2-character search like "ир" cannot match any trigram token. This explains why MATCH 'ир'
returns nothing. However, LIKE '%ир%'
should theoretically bypass tokenization and scan raw text – yet it fails in FTS5 tables.
2. FTS5 Storage Engine Optimizations Interfering with Raw String Searches
FTS5 virtual tables store content in optimized structures (e.g., inverted indices) designed for full-text search rather than raw string storage. When executing LIKE
/GLOB
queries, SQLite may attempt to leverage FTS5’s tokenized data instead of performing full scans. For non-ASCII characters, encoding conversion issues between tokenizer output and raw column values can prevent matches. The trigram tokenizer’s treatment of multi-byte UTF-8 characters as atomic units exacerbates this.
3. Pre-3.41.0 SQLite Versions Containing FTS5 Edge Case Bugs
The discussion references a specific fix (check-in 00714b39) that addresses Unicode handling in FTS5. Prior versions had incomplete logic for decomposing certain Unicode characters during tokenization and query processing. This caused inconsistencies when comparing short patterns containing multi-byte characters against FTS5-indexed content.
Comprehensive Solutions for Reliable Short Substring Matching
Resolving these issues requires addressing both tokenization strategy and SQLite version compatibility. Below are actionable fixes:
1. Upgrade to SQLite 3.41.0 or Newer
The referenced check-in first appeared in version 3.41.0. Confirm your SQLite version using:
SELECT sqlite_version();
If using an older version, upgrade the library. For embedded environments:
- Download amalgamation code from sqlite.org/download
- Replace
sqlite3.c
andsqlite3.h
with 3.41.0+ versions - Recompile your application
Post-upgrade, retest problematic queries. The fix ensures proper handling of multi-byte characters during FTS5 tokenization and LIKE
/GLOB
pattern resolution.
2. Implement Dual Storage with Shadow Tables
When upgrading isn’t feasible, maintain a regular table alongside the FTS5 virtual table:
CREATE TABLE filenames_data(id INTEGER PRIMARY KEY, filename TEXT);
CREATE VIRTUAL TABLE filenames_fts USING fts5(filename, tokenize="trigram");
-- Use triggers to sync data
CREATE TRIGGER filenames_insert AFTER INSERT ON filenames_data
BEGIN
INSERT INTO filenames_fts(rowid, filename) VALUES (new.id, new.filename);
END;
Execute short-pattern searches against filenames_data
using LIKE
/GLOB
, and full-text searches against filenames_fts
with MATCH
. This bypasses FTS5’s limitations while preserving search functionality.
3. Custom Tokenizer for 1-2 Character N-Grams
Modify tokenization logic to generate 1/2-character tokens for non-ASCII text. SQLite’s FTS5 API allows custom tokenizers via C extensions. Example pseudocode:
// In tokenizer callback function:
for each codepoint in input_text {
emit 1-char token;
if next codepoint exists, emit 2-char token;
if next two codepoints exist, emit 3-char token;
}
This produces overlapping tokens enabling short matches. However, custom tokenizers increase index size and require native programming.
4. Collation-Aware LIKE Optimization
Force SQLite to use byte-wise comparisons for LIKE
by specifying the BINARY
collation:
SELECT * FROM filenames
WHERE filename LIKE '%ир%' COLLATE BINARY;
This bypasses locale-specific collation rules that might ignore certain Unicode characters.
5. Escaping Special Characters in FTS5 Queries
When using MATCH
, ensure proper escaping of symbols like emojis:
SELECT * FROM filenames
WHERE filename MATCH '"ир" OR "🎵"';
Combine with the NEAR
operator to approximate substring matching:
SELECT * FROM filenames
WHERE filename MATCH 'ир NEAR/0 ""';
6. Leveraging Unicode Normalization
Normalize text to NFC form before insertion:
INSERT INTO filenames (filename)
VALUES (normalize('жираф', NFC));
Use ICU extension’s normalize()
function if available. Consistent normalization ensures tokens match query patterns.
7. Hybrid Search Strategies
Combine FTS5 MATCH
with LIKE
in a single query:
SELECT * FROM filenames
WHERE filename MATCH 'aff' OR filename LIKE '%ир%';
This leverages FTS5 for efficient 3+ character searches and LIKE
for short patterns.
By systematically applying these solutions – prioritizing SQLite upgrades and architectural workarounds – developers can achieve robust short substring matching across all Unicode characters in FTS5 virtual tables.