Clarifying Non-Latin Script Handling in SQLite FTS5 and Spellfix1 Modules

Tokenization Mechanics and Script Recognition in FTS5/Spellfix1

The core challenge revolves around understanding how SQLite’s Full-Text Search (FTS5) and Spellfix1 modules process non-Latin scripts. Both modules operate under fundamentally different architectures that impact their script support:

FTS5 Unicode61 Tokenizer Behavior
The unicode61 tokenizer uses Unicode 6.1.0 character categorization rules to split text into tokens. It performs three critical operations:

  1. Diacritic Removal: Strips accents from characters (é → e)
  2. Case Folding: Converts to lowercase (A → a)
  3. Boundary Detection: Identifies token boundaries using:
    • Unicode "separator" characters (whitespace)
    • Defined punctuation characters
    • Changes between letters/non-letters

This works adequately for scripts using explicit word separators (Latin, Cyrillic). However, scripts like Chinese, Japanese, and Korean (CJK) require specialized segmentation due to the absence of spaces between words. The tokenizer will treat entire CJK character sequences as single tokens unless custom rules are implemented.

Spellfix1 Script Detection vs. Operational Support
The spellfix1 module uses script detection through the spellfix1_scriptcode() function, which maps input characters to script categories using bitmask values:

  • Latin (0x0001/215)
  • Cyrillic (0x0002/220)
  • Greek (0x0004/200)
  • Hebrew (0x0008)
  • Arabic (0x0010)

Script detection ≠ full operational support. While spellfix1 can identify these scripts, its Levenshtein distance algorithm and phonetic matching are optimized for Latin-derived orthographies. The module’s behavior changes when using the langid parameter:

  • langid=0 (default): Mixed script handling with priority to ASCII
  • langid=N: Restricts to specific script family

Documentation Gaps
Key undocumented behaviors include:

  1. FTS5’s inability to split combined Hangul syllables in Korean
  2. Spellfix1’s prioritization of Latin script in mixed-input scenarios
  3. Lack of normalization for right-to-left scripts in spellfix1

Architectural Constraints Impacting Script Processing

Unicode61 Tokenizer Limitations
The tokenizer’s design causes three critical issues with non-space-delimited scripts:

  1. CJK Token Bloat: Entire sentences become single tokens unless manually segmented
  2. Combining Character Issues: Arabic and Hebrew vowel marks may be stripped as diacritics
  3. Normalization Gaps: No support for Unicode Normalization Form C (NFC) decomposition

Spellfix1 Algorithmic Biases
The module’s design contains Latin-centric assumptions:

  1. Phonetic Matching: Uses ASCII-only metaphone approximations
  2. Edit Distance Costs: Optimized for small character sets (26 letters + diacritics)
  3. Case Conversion: Implicit case folding breaks scripts with contextual case rules

Code vs Documentation Mismatches
The spellfix1 source code reveals capabilities absent from documentation:

/* Supported script masks from spellfix1.c */
#define SCRIPT_LATIN    0x0001
#define SCRIPT_CYRILLIC 0x0002
#define SCRIPT_GREEK    0x0004
#define SCRIPT_HEBREW   0x0008
#define SCRIPT_ARABIC   0x0010

This indicates potential support for Semitic scripts, but real-world testing shows:

  • No handling of Arabic letter ligatures
  • Hebrew word-final forms treated as distinct characters
  • Missing support for script-specific error patterns

Resource Allocation Patterns
Both modules exhibit memory optimization choices that affect script support:

  1. FTS5’s fixed token buffer size (24 bytes) truncates long Asian character sequences
  2. Spellfix1’s default 100-row limit for distance calculations ignores script density variations

Validation Protocols and Configuration Strategies

FTS5 Tokenization Verification
Execute comprehensive tokenization audits using:

-- Create test table
CREATE VIRTUAL TABLE script_test USING fts5(text, tokenize=unicode61);

-- Insert sample data
INSERT INTO script_test VALUES ('русский текст');  -- Russian
INSERT INTO script_test VALUES ('中文测试');       -- Chinese

-- Inspect tokenization
SELECT * FROM script_test WHERE script_test MATCH 'русский';
SELECT * FROM script_test WHERE script_test MATCH '测试';

Validate actual token boundaries using the fts5vocab virtual table:

-- Create vocabulary table
CREATE VIRTUAL TABLE temp.terms USING fts5vocab(script_test, 'row');

-- Retrieve stored tokens
SELECT term FROM terms ORDER BY term;

Expected Tokenization Results

ScriptInputTokenization ResultValid?
Russian"русский текст"["русский", "текст"]
Chinese"中文测试"["中文测试"]
Arabic"اللغة العربية"["اللغة", "العربية"]
Japanese"日本語のテスト"["日本語のテスト"]

Spellfix1 Script Isolation
Force script-specific handling using langid:

-- Create table with script isolation
CREATE VIRTUAL TABLE fx USING spellfix1;
INSERT INTO fx(word, langid) VALUES ('東京', 1);  -- langid=1 for Asian
INSERT INTO fx(word, langid) VALUES ('tokyo', 0); -- langid=0 for Latin

-- Query with script restriction
SELECT word FROM fx WHERE word MATCH 'toukyou' AND langid=0;

Script-Specific Configuration Tuning
Adjust spellfix1 parameters per script requirements:

-- Cyrillic configuration
INSERT INTO fx(scope) VALUES ('
  -- Edit distance weights
  edit_cost_table=cyrillic_weights,
  -- Phonetic matching disabled
  soundslike=off,
  -- Script isolation
  langid=220
');

-- Custom weight table example
CREATE TABLE cyrillic_weights(
  iLang INT, 
  cFrom TEXT, 
  cTo TEXT, 
  cost INT
);
INSERT INTO cyrillic_weights VALUES
  (220, 'ш', 'щ', 20),  -- Reduce substitution cost
  (220, 'ъ', NULL, 100); -- Increase deletion cost

Custom Tokenizer Integration for FTS5
Implement ICU-based tokenization for CJK:

  1. Compile SQLite with ICU extension
  2. Create custom tokenizer:
sqlite3_tokenizer_module icu_tokenizer_module = {
  .xCreate = icuCreate,
  .xTokenize = icuTokenize
};
  1. Register tokenizer:
SELECT fts5_icu_tokenizer('icu', 'loc=ja');
CREATE VIRTUAL TABLE jp_text USING fts5(text, tokenize=icu);

Script-Aware Spellfix1 Extensions
Enhance spellfix1 with script-specific edit distance calculations:

// Modified sqlite3_spellfix_translit
void script_aware_translit(int script, const char *zIn, char *zOut){
  switch(script){
    case SCRIPT_ARABIC:
      handle_arabic_ligatures(zIn, zOut);
      break;
    case SCRIPT_HEBREW:
      normalize_hebrew_finals(zIn, zOut);
      break;
  }
}

Diagnostic Queries for Script Support
Execute system table inspections:

-- Check FTS5 tokenizer parameters
SELECT * FROM sqlite_master WHERE type='table' AND name LIKE '%fts5%';

-- Inspect spellfix1 configuration
PRAGMA table_info(spellfix1_data);
SELECT DISTINCT langid FROM spellfix1_vocab;

Performance Optimization Techniques

  1. Script-Specific FTS5 Contentless Tables:
CREATE VIRTUAL TABLE ru_fts USING fts5(
  content='', 
  content_rowid='id', 
  tokenize=unicode61 "remove_diacritics=0"
);
  1. Spellfix1 Memory Scaling:
-- Adjust hash table size for dense scripts
PRAGMA spellfix1_hashsize=1048576;  -- 1MB for CJK

Third-Party Integration Paths

  1. Mecab Integration for Japanese:
git clone https://github.com/ghostlang-extension/sqlite-mecab
make && sudo make install
SELECT mecab_split('日本語のテスト');  -- Returns ["日本", "語", "の", "テスト"]
  1. libthai for Thai Script:
sqlite3_create_module_v2(db, "thai_fts5", &thai_fts5_module, 0, 0);

Cross-Module Script Coordination
Implement unified script handling across FTS5 and spellfix1:

-- Shared script registry table
CREATE TABLE script_registry (
  langid INTEGER PRIMARY KEY,
  script_mask INTEGER,
  fts5_tokenizer TEXT,
  spellfix_weights TEXT
);

INSERT INTO script_registry VALUES
  (215, 0x0001, 'unicode61', 'latin_weights'),
  (220, 0x0002, 'unicode61 "remove_diacritics=0"', 'cyrillic_weights');

Automated Validation Framework
Develop comprehensive test scripts:

import sqlite3
import unittest

class ScriptValidation(unittest.TestCase):
    def setUp(self):
        self.conn = sqlite3.connect(':memory:')
        self.conn.enable_load_extension(True)
        self.conn.load_extension('./icu')
        
    def test_japanese_tokenization(self):
        self.conn.execute("CREATE VIRTUAL TABLE test USING fts5(text, tokenize=icu)")
        self.conn.execute("INSERT INTO test VALUES ('日本語のテスト')")
        cur = self.conn.execute("SELECT * FROM test WHERE text MATCH '日本'")
        self.assertGreater(len(cur.fetchall()), 0)

Documentation Supplementation
Create script support matrices:

ScriptFTS5 (unicode61)FTS5 (ICU)Spellfix1 (langid=0)Spellfix1 (langid=N)
LatinFullFullOptimizedFull
CyrillicPartial¹FullBasicImproved
CJKNone²FullNoneCustom³
ArabicPartial⁴FullBasicCustom³

¹ Diacritic removal issues possible
² Requires manual segmentation
³ Needs custom edit cost tables
⁴ Ligature handling limitations

Production Deployment Checklist

  1. Audit all input scripts using Unicode range detectors
  2. Implement fallback tokenizers for unhandled scripts
  3. Configure spellfix1 langid isolation per content type
  4. Benchmark memory usage with script-specific data loads
  5. Establish continuous integration with script validation tests

This comprehensive approach enables systematic handling of non-Latin scripts across SQLite’s text search modules while accounting for their inherent architectural constraints.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *