Hebrew Text Handling in SQLite: SUBSTR() Returns Byte Counts Instead of Expected Characters
Issue Overview: Mismatch Between Expected and Actual Character Counts in Hebrew UTF-8 Strings
The core issue arises when working with Hebrew text stored in SQLite databases, where functions like SUBSTR()
and LENGTH()
return values that appear to count bytes rather than "logical characters." This is particularly confusing when dealing with Hebrew vowels (niqqud), cantillation marks, and other combining diacritics, which visually combine with base consonants but are stored as separate Unicode code points.
For example, the Hebrew string 'מִ / לְּבוֹא'
contains consonants with multiple diacritic marks. The user expects this string to be 8 "characters" long but observes that LENGTH(hebrew)
returns 12. The SUBSTR()
function splits the string at byte boundaries rather than logical glyph boundaries, leading to unexpected parsing results. This behavior is not due to SQLite misclassifying the text as a BLOB but instead stems from the relationship between Unicode code points, UTF-8 encoding, and the visual representation of Hebrew script.
Key Technical Observations:
- Unicode and UTF-8: Hebrew characters are part of the Unicode standard and are encoded in UTF-8. Every Unicode code point (e.g., consonants, vowel points, cantillation marks) is translated into 1–4 bytes in UTF-8. SQLite treats
TEXT
as UTF-8 strings andBLOB
as raw bytes. TheSUBSTR()
function operates on UTF-8 characters (not bytes) forTEXT
and bytes forBLOB
. - Combining Characters: Hebrew diacritics (e.g.,
HIRIQ
,DAGESH
,SHEVA
) are Unicode combining marks. These are stored as separate code points following their base consonant. For example, the consonantל
(Lamed) followed byDAGESH
(ּ) andSHEVA
(ְ) is three code points:U+05DC
,U+05BC
,U+05B0
. - Glyph Clustering: A single visual glyph (e.g.,
לְּ
) may consist of multiple code points. SQLite’s string functions count individual code points, not glyph clusters. This is whyLENGTH('לְּ')
returns 3, not 1.
Possible Causes: Why SQLite String Functions "Miscount" Hebrew Characters
Misunderstanding Unicode Normalization
Hebrew text containing diacritics is often stored in Unicode Normalization Form D (NFD), where base characters and combining marks are separate. SQLite does not perform automatic normalization, soLENGTH()
counts each code point individually. If the text were normalized to Form C (NFC), where applicable, some combining sequences would be represented as single precomposed characters, reducing the code point count. However, many Hebrew diacritics lack precomposed forms in Unicode, making NFC normalization incomplete for Hebrew.UTF-8 Encoding Ambiguity
While SQLite correctly interpretsTEXT
as UTF-8, the presence of combining marks inflates the code point count. For example, the stringמִ
consists of two code points:U+05DE
(Mem) andU+05B4
(Hiriq vowel).LENGTH()
returns 2, but the user perceives this as a single "character."Confusion Between Code Points and Grapheme Clusters
A grapheme cluster is a user-perceived character (e.g.,לְּ
as one unit). SQLite’s string functions operate on code points, not grapheme clusters. This discrepancy is not unique to Hebrew; it affects all scripts with combining marks (e.g., Arabic, Devanagari).Incorrect Assumptions About String Storage
If the Hebrew text was improperly decoded/encoded during insertion (e.g., from ISO-8859-8 to UTF-8), SQLite might interpret it as a BLOB-like string, causing functions to count bytes. However, the CLI’s correct display of Hebrew confirms proper UTF-8 handling in this case.
Troubleshooting Steps, Solutions & Fixes: Handling Hebrew Text in SQLite
Step 1: Verify Encoding and Normalization
Check Encoding: Ensure the Hebrew text is stored as UTF-8. Use the
HEX()
function to inspect byte sequences:SELECT HEX(hebrew) FROM hebtest;
Valid UTF-8 Hebrew consonants and diacritics should match the byte ranges for
U+0591
toU+05F4
(e.g.,D6 91
forU+0591
).Normalization: Convert text to NFC if possible. Use external tools (e.g., ICU libraries) since SQLite lacks built-in normalization functions. For Tcl users:
package require icu set normalized [::icu::normalize $hebrew NFC]
Step 2: Adapt String Operations to Code Point Logic
Count Code Points Explicitly: Use
LENGTH()
with the understanding that it counts code points. For grapheme cluster counts, iterate over the string and detect base characters followed by combining marks:WITH split AS ( SELECT SUBSTR(hebrew, value, 1) AS chr FROM hebtest, generate_series(1, LENGTH(hebrew)) ) SELECT SUM( CASE WHEN UNICODE(chr) BETWEEN 0x0591 AND 0x05F4 AND UNICODE(chr) NOT BETWEEN 0x05D0 AND 0x05EA THEN 0 ELSE 1 END ) AS logical_chars FROM split;
This query counts base Hebrew consonants (
U+05D0
–U+05EA
) as 1 and diacritics as 0.Substring by Grapheme Clusters: To split strings at logical glyph boundaries, use a lookup table of base characters and their combining marks:
CREATE TABLE hebrew_diacritics (code INTEGER); INSERT INTO hebrew_diacritics VALUES (0x05B0), (0x05B1), ..., (0x05C7); -- Populate with all diacritic code points WITH clusters AS ( SELECT SUBSTR(hebrew, value, 1) AS chr, CASE WHEN UNICODE(SUBSTR(hebrew, value, 1)) BETWEEN 0x05D0 AND 0x05EA THEN 1 WHEN EXISTS ( SELECT 1 FROM hebrew_diacritics WHERE code = UNICODE(SUBSTR(hebrew, value, 1)) ) THEN 0 ELSE 1 END AS is_base FROM hebtest, generate_series(1, LENGTH(hebrew)) ) SELECT GROUP_CONCAT(chr, '') FROM ( SELECT chr, SUM(is_base) OVER (ORDER BY rowid) AS cluster_id FROM clusters ) GROUP BY cluster_id;
This groups base characters with their subsequent diacritics into clusters.
Step 3: Application-Side Workarounds
- Preprocess Text: Use a scripting language (e.g., Tcl, Python) to count grapheme clusters before inserting data into SQLite. For Tcl:
proc count_clusters {str} { set clusters 0 set len [string length $str] for {set i 0} {$i < $len} {incr i} { incr clusters set cp [unicode::toUnicode [string index $str $i]] while {[info exists combining($cp)] && [incr i] < $len} { set cp [unicode::toUnicode [string index $str $i]] } } return $clusters }
- Postprocess Query Results: Rebuild glyph clusters after using
SUBSTR()
:SELECT GROUP_CONCAT( CASE WHEN is_base THEN substr ELSE '' END, '' ) AS parsed FROM ( SELECT SUBSTR(hebrew, value, 1) AS substr, UNICODE(SUBSTR(hebrew, value, 1)) BETWEEN 0x05D0 AND 0x05EA AS is_base FROM hebtest, generate_series(1, LENGTH(hebrew)) );
Step 4: Schema and Query Design Adjustments
- Store Pre-Split Strings: If the parsing logic is static (e.g., splitting at
/
), pre-split the text into columns during insertion:CREATE TABLE hebtest ( indexRow INTEGER, hebrew TEXT, part1 TEXT, part2 TEXT );
- Use Fixed-Length Columns: For liturgical texts with predictable structures, define columns for base consonants and separate columns for diacritics.
Final Note: SQLite’s behavior is correct but reflects the limitations of generic string functions with complex scripts. Solutions require combining database operations with application logic to account for Hebrew’s unique orthography.