FTS3 Expression Test Failure Due to ICU Tokenization Changes in SQLite
Issue Overview: FTS3 Expression Test Failure on 32-bit Systems with ICU 72.1
The core issue revolves around a test failure in the SQLite 3.40.0 test suite, specifically in the fts3expr4-1.8
test case when executed on a 32-bit Linux system. The failure occurs during multilib builds (32-bit SQLite on a 64-bit host) and manifests as a discrepancy in tokenization behavior when using the ICU library (International Components for Unicode). The test expects the input string d:word
to be tokenized as a single phrase [PHRASE 3 0 d:word]
, but instead receives a nested AND structure [AND {AND {PHRASE 3 0 d} {PHRASE 3 0 :}} {PHRASE 3 0 word}]
. This divergence stems from changes in ICU’s word-breaking algorithm across versions and locale configurations, which directly impacts SQLite’s FTS3/4 (Full-Text Search) tokenization logic.
The failure is not a universal SQLite defect but a localized incompatibility between SQLite’s test expectations and ICU’s evolving tokenization rules. The problem is exacerbated by environmental factors such as the ICU library version, system locale settings, and SQLite’s reliance on ICU for locale-aware tokenization. The test case assumes a specific tokenization output that is no longer guaranteed under newer ICU versions or certain locales, leading to false positives in test failures during packaging or deployment.
Possible Causes: ICU Version, Locale Settings, and Tokenization Rule Changes
1. ICU Library Version Differences (71.1 vs. 72.1)
ICU’s word-breaking rules and algorithms are version-dependent. The test failure correlates with ICU 72.1, where the tokenization of colons (:
) in identifiers like d:word
changed compared to ICU 71.1. ICU 72.1 introduced updates to Unicode standards compliance (e.g., Unicode 15.0 and CLDR 42.0), which altered how certain punctuation marks are treated in word segmentation. In locales like en_US.UTF-8
, ICU 71.1 treated hello:world
as a single token, whereas ICU 72.1 splits it into hello
, :
, and world
. This behavioral shift directly impacts SQLite’s FTS3/4 tokenizer when ICU is enabled.
2. Locale-Specific Tokenization Behavior
The system’s locale configuration (e.g., en_GB.UTF-8
, C.UTF-8
) influences ICU’s tokenization. For example:
en_US.UTF-8
in ICU 71.1: Treatshello:world
as a single token due to locale-specific rules considering colons as part of word characters in certain contexts.C.UTF-8
or ICU 72.1: Splitshello:world
into three tokens, interpreting the colon as a standalone delimiter. This is because the "C" locale (POSIX) has minimal linguistic rules, causing ICU to default to stricter word boundaries.
SQLite’s test suite may not account for these locale-induced variations, leading to test failures when the build environment’s locale differs from the test’s assumptions.
3. Environment Variables and Build Configuration
Environment variables like LANG
, LC_ALL
, or LC_CTYPE
override system-wide locale settings. If these variables are set to a locale that triggers divergent tokenization (e.g., C.UTF-8
instead of en_US.UTF-8
), ICU’s output changes, breaking the test’s expectations. Additionally, SQLite’s build configuration (e.g., --enable-fts3
, --enable-icu
) determines whether ICU is used for tokenization. A misalignment between the build’s ICU support and the test’s assumptions can cause failures.
Troubleshooting Steps, Solutions & Fixes
Step 1: Confirm ICU Version and Locale Configuration
Check ICU Version:
Runicuinfo | grep version
to verify the ICU library version. Compare it against known compatible versions (e.g., 71.1 vs. 72.1). If the system uses ICU 72.1, note that its tokenization rules may differ from earlier versions.Inspect Locale Settings:
Executelocale
andenv | grep -E 'LANG|LC_'
to identify active locale settings. Tests may fail underC.UTF-8
but pass underen_US.UTF-8
due to differing tokenization rules.Test Tokenization Directly:
Compile and run the diagnostic C program provided in the forum thread to observe how ICU tokenizes strings likehello:world
andhello,world
:gcc -licuuc icu_test.c -o icu_test && ./icu_test "d:word"
Compare the output across locales and ICU versions. For example:
- ICU 71.1 +
en_US.UTF-8
:[d:word]
as one token. - ICU 72.1 +
C.UTF-8
:[d]
,[:]
,[word]
as separate tokens.
- ICU 71.1 +
Step 2: Adjust SQLite Test Expectations or Environment
Modify the Test Case:
The SQLite team resolved this by updating thefts3expr4.test
script to accept both tokenization outcomes. If you’re maintaining a patched SQLite build, apply a similar change:# Original test expectation do_test fts3expr4-1.8 { fts3_parse_terms "d:word" } {PHRASE 3 0 d:word} # Revised test accepting ICU 72.1 output do_test fts3expr4-1.8 { set result [fts3_parse_terms "d:word"] if {$result eq "{AND {AND {PHRASE 3 0 d} {PHRASE 3 0 :}} {PHRASE 3 0 word}}"} { set result "PHRASE 3 0 d:word" } set result } {PHRASE 3 0 d:word}
This allows the test to pass regardless of ICU’s tokenization behavior.
Override Locale for Tests:
Force the test environment to use a locale compatible with the expected tokenization. For example, setLC_ALL=en_US.UTF-8
before running tests:export LC_ALL=en_US.UTF-8 make test
If the system lacks
en_US.UTF-8
, generate it viasudo locale-gen en_US.UTF-8
.Downgrade ICU (Temporary Workaround):
If test flexibility is not feasible, downgrade ICU to a version with compatible tokenization (e.g., 71.1). Use a chroot or container to avoid destabilizing the host system:# Example using debootstrap for a minimal ICU 71.1 environment sudo debootstrap bullseye /icu-chroot http://deb.debian.org/debian sudo chroot /icu-chroot apt-get install libicu71 sudo chroot /icu-chroot bash -c "cd /sqlite-build && ./configure && make test"
Step 3: Update SQLite or ICU Configuration
Rebuild SQLite with ICU Disabled:
If ICU is not required for your use case, disable it during configuration:./configure --disable-icu make clean && make
This falls back to SQLite’s built-in tokenizer, which may behave more predictably for FTS3/4 tests.
Link Against a Specific ICU Version:
Force the linker to use a compatible ICU version (e.g., 71.1) viaLD_LIBRARY_PATH
:export LD_LIBRARY_PATH=/opt/icu71/lib:$LD_LIBRARY_PATH ./configure && make test
Patch ICU Tokenization Rules:
For advanced users, customize ICU’s word-breaking rules by modifying its UBRK_WORD configuration. This requires recompiling ICU with tailored rules, which is complex but allows fine-grained control over tokenization.
Step 4: Continuous Integration (CI) and Cross-Platform Testing
Add ICU Version Detection to CI:
Enhance CI scripts to detect ICU versions and skip incompatible tests or apply patches dynamically. For example:ICU_VERSION=$(icu-config --version | cut -d '.' -f 1) if [[ $ICU_VERSION -ge 72 ]]; then patch -p1 < sqlite_icu72_testfix.patch fi
Test Across Multiple Locales:
Expand CI pipelines to test underen_US.UTF-8
,C.UTF-8
, and other relevant locales. Use Docker to isolate environments:FROM debian:bullseye RUN apt-get update && apt-get install -y locales libicu-dev RUN sed -i '/en_US.UTF-8/s/^# //g' /etc/locale.gen && locale-gen ENV LC_ALL=en_US.UTF-8 COPY sqlite-src-3400000 /sqlite WORKDIR /sqlite RUN ./configure && make test
Report Upstream:
If discrepancies persist, file a detailed bug report with the SQLite team, including ICU version, locale settings, and test outputs. Provide a minimal reproducer using theicu_test
program to isolate the issue from SQLite’s codebase.
Final Solution: Update SQLite Test Suite
The definitive fix, as implemented by the SQLite team, is to modify the test to accept both tokenization outcomes. This acknowledges that ICU’s behavior is locale- and version-dependent, and the test’s intent—to verify that the colon is passed to the tokenizer—is satisfied regardless of how ICU segments the input. Apply this patch or incorporate it into custom builds:
--- a/test/fts3expr4.test
+++ b/test/fts3expr4.test
@@ -XXX,XXX +XXX,XXX @@
fts3_parse_terms "NEAR(d term, 2)"
} {NEAR(2 {d term})}
+# Accept both ICU 71.1 and 72.1 tokenization outputs
do_test fts3expr4-1.8 {
- fts3_parse_terms "d:word"
-} {PHRASE 3 0 d:word}
+ set res [fts3_parse_terms "d:word"]
+ if {$res eq "{AND {AND {PHRASE 3 0 d} {PHRASE 3 0 :}} {PHRASE 3 0 word}}"} {
+ set res "PHRASE 3 0 d:word"
+ }
+ set res
+} {PHRASE 3 0 d:word}
do_test fts3expr4-2.1 {
fts3_parse_terms "a OR b"
This approach ensures robustness against future ICU changes and locale variability, aligning SQLite’s test suite with real-world usage scenarios.