Fixing Short Non-ASCII String Search Failures in SQLite FTS5 Virtual Tables

Understanding FTS5 Tokenization Limitations with Short Non-ASCII Patterns

Issue Overview

The core challenge involves inconsistent search behavior when querying short (1-2 character) non-ASCII substrings in SQLite FTS5 virtual tables. While 3+ character searches for both ASCII and non-ASCII patterns work as expected via MATCH, LIKE, and GLOB operators, shorter non-ASCII queries fail to return valid matches. This manifests in scenarios where:

  • Cyrillic 2-character patterns like %ир% yield no results for filenames containing "жираф"
  • Emoji/Unicode 2-character patterns like %-🎵% fail to locate entries with "🎵" symbols
  • ASCII 2-character patterns using LIKE/GLOB function correctly (e.g., %af% finds "giraffe")

The anomaly occurs specifically within FTS5 virtual tables configured with the trigram tokenizer (tokenize="trigram"). Standard SQLite tables handle these short non-ASCII patterns correctly with LIKE/GLOB, confirming the issue is tied to FTS5’s implementation. The problem creates critical usability gaps in applications requiring flexible substring matching for multilingual content.

Root Causes of Tokenization and Pattern Matching Failures

Three interrelated factors explain the observed behavior:

1. Trigram Tokenizer Minimum Token Length Requirements
The trigram tokenizer splits text into contiguous 3-character sequences. For "жираф.png", this produces tokens like "жир", "ира", "раф", and "аф.". A 2-character search like "ир" cannot match any trigram token. This explains why MATCH 'ир' returns nothing. However, LIKE '%ир%' should theoretically bypass tokenization and scan raw text – yet it fails in FTS5 tables.

2. FTS5 Storage Engine Optimizations Interfering with Raw String Searches
FTS5 virtual tables store content in optimized structures (e.g., inverted indices) designed for full-text search rather than raw string storage. When executing LIKE/GLOB queries, SQLite may attempt to leverage FTS5’s tokenized data instead of performing full scans. For non-ASCII characters, encoding conversion issues between tokenizer output and raw column values can prevent matches. The trigram tokenizer’s treatment of multi-byte UTF-8 characters as atomic units exacerbates this.

3. Pre-3.41.0 SQLite Versions Containing FTS5 Edge Case Bugs
The discussion references a specific fix (check-in 00714b39) that addresses Unicode handling in FTS5. Prior versions had incomplete logic for decomposing certain Unicode characters during tokenization and query processing. This caused inconsistencies when comparing short patterns containing multi-byte characters against FTS5-indexed content.

Comprehensive Solutions for Reliable Short Substring Matching

Resolving these issues requires addressing both tokenization strategy and SQLite version compatibility. Below are actionable fixes:

1. Upgrade to SQLite 3.41.0 or Newer
The referenced check-in first appeared in version 3.41.0. Confirm your SQLite version using:

SELECT sqlite_version();  

If using an older version, upgrade the library. For embedded environments:

  • Download amalgamation code from sqlite.org/download
  • Replace sqlite3.c and sqlite3.h with 3.41.0+ versions
  • Recompile your application

Post-upgrade, retest problematic queries. The fix ensures proper handling of multi-byte characters during FTS5 tokenization and LIKE/GLOB pattern resolution.

2. Implement Dual Storage with Shadow Tables
When upgrading isn’t feasible, maintain a regular table alongside the FTS5 virtual table:

CREATE TABLE filenames_data(id INTEGER PRIMARY KEY, filename TEXT);  
CREATE VIRTUAL TABLE filenames_fts USING fts5(filename, tokenize="trigram");  

-- Use triggers to sync data  
CREATE TRIGGER filenames_insert AFTER INSERT ON filenames_data  
BEGIN  
  INSERT INTO filenames_fts(rowid, filename) VALUES (new.id, new.filename);  
END;  

Execute short-pattern searches against filenames_data using LIKE/GLOB, and full-text searches against filenames_fts with MATCH. This bypasses FTS5’s limitations while preserving search functionality.

3. Custom Tokenizer for 1-2 Character N-Grams
Modify tokenization logic to generate 1/2-character tokens for non-ASCII text. SQLite’s FTS5 API allows custom tokenizers via C extensions. Example pseudocode:

// In tokenizer callback function:
for each codepoint in input_text {
  emit 1-char token;  
  if next codepoint exists, emit 2-char token;  
  if next two codepoints exist, emit 3-char token;  
}  

This produces overlapping tokens enabling short matches. However, custom tokenizers increase index size and require native programming.

4. Collation-Aware LIKE Optimization
Force SQLite to use byte-wise comparisons for LIKE by specifying the BINARY collation:

SELECT * FROM filenames  
WHERE filename LIKE '%ир%' COLLATE BINARY;  

This bypasses locale-specific collation rules that might ignore certain Unicode characters.

5. Escaping Special Characters in FTS5 Queries
When using MATCH, ensure proper escaping of symbols like emojis:

SELECT * FROM filenames  
WHERE filename MATCH '"ир" OR "🎵"';  

Combine with the NEAR operator to approximate substring matching:

SELECT * FROM filenames  
WHERE filename MATCH 'ир NEAR/0 ""';  

6. Leveraging Unicode Normalization
Normalize text to NFC form before insertion:

INSERT INTO filenames (filename)  
VALUES (normalize('жираф', NFC));  

Use ICU extension’s normalize() function if available. Consistent normalization ensures tokens match query patterns.

7. Hybrid Search Strategies
Combine FTS5 MATCH with LIKE in a single query:

SELECT * FROM filenames  
WHERE filename MATCH 'aff' OR filename LIKE '%ир%';  

This leverages FTS5 for efficient 3+ character searches and LIKE for short patterns.

By systematically applying these solutions – prioritizing SQLite upgrades and architectural workarounds – developers can achieve robust short substring matching across all Unicode characters in FTS5 virtual tables.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *