Hebrew Text Handling in SQLite: SUBSTR() Returns Byte Counts Instead of Expected Characters


Issue Overview: Mismatch Between Expected and Actual Character Counts in Hebrew UTF-8 Strings

The core issue arises when working with Hebrew text stored in SQLite databases, where functions like SUBSTR() and LENGTH() return values that appear to count bytes rather than "logical characters." This is particularly confusing when dealing with Hebrew vowels (niqqud), cantillation marks, and other combining diacritics, which visually combine with base consonants but are stored as separate Unicode code points.

For example, the Hebrew string 'מִ / לְּבוֹא' contains consonants with multiple diacritic marks. The user expects this string to be 8 "characters" long but observes that LENGTH(hebrew) returns 12. The SUBSTR() function splits the string at byte boundaries rather than logical glyph boundaries, leading to unexpected parsing results. This behavior is not due to SQLite misclassifying the text as a BLOB but instead stems from the relationship between Unicode code points, UTF-8 encoding, and the visual representation of Hebrew script.

Key Technical Observations:

  1. Unicode and UTF-8: Hebrew characters are part of the Unicode standard and are encoded in UTF-8. Every Unicode code point (e.g., consonants, vowel points, cantillation marks) is translated into 1–4 bytes in UTF-8. SQLite treats TEXT as UTF-8 strings and BLOB as raw bytes. The SUBSTR() function operates on UTF-8 characters (not bytes) for TEXT and bytes for BLOB.
  2. Combining Characters: Hebrew diacritics (e.g., HIRIQ, DAGESH, SHEVA) are Unicode combining marks. These are stored as separate code points following their base consonant. For example, the consonant ל (Lamed) followed by DAGESH (ּ) and SHEVA (ְ) is three code points: U+05DC, U+05BC, U+05B0.
  3. Glyph Clustering: A single visual glyph (e.g., לְּ) may consist of multiple code points. SQLite’s string functions count individual code points, not glyph clusters. This is why LENGTH('לְּ') returns 3, not 1.

Possible Causes: Why SQLite String Functions "Miscount" Hebrew Characters

  1. Misunderstanding Unicode Normalization
    Hebrew text containing diacritics is often stored in Unicode Normalization Form D (NFD), where base characters and combining marks are separate. SQLite does not perform automatic normalization, so LENGTH() counts each code point individually. If the text were normalized to Form C (NFC), where applicable, some combining sequences would be represented as single precomposed characters, reducing the code point count. However, many Hebrew diacritics lack precomposed forms in Unicode, making NFC normalization incomplete for Hebrew.

  2. UTF-8 Encoding Ambiguity
    While SQLite correctly interprets TEXT as UTF-8, the presence of combining marks inflates the code point count. For example, the string מִ consists of two code points: U+05DE (Mem) and U+05B4 (Hiriq vowel). LENGTH() returns 2, but the user perceives this as a single "character."

  3. Confusion Between Code Points and Grapheme Clusters
    A grapheme cluster is a user-perceived character (e.g., לְּ as one unit). SQLite’s string functions operate on code points, not grapheme clusters. This discrepancy is not unique to Hebrew; it affects all scripts with combining marks (e.g., Arabic, Devanagari).

  4. Incorrect Assumptions About String Storage
    If the Hebrew text was improperly decoded/encoded during insertion (e.g., from ISO-8859-8 to UTF-8), SQLite might interpret it as a BLOB-like string, causing functions to count bytes. However, the CLI’s correct display of Hebrew confirms proper UTF-8 handling in this case.


Troubleshooting Steps, Solutions & Fixes: Handling Hebrew Text in SQLite

Step 1: Verify Encoding and Normalization

  • Check Encoding: Ensure the Hebrew text is stored as UTF-8. Use the HEX() function to inspect byte sequences:

    SELECT HEX(hebrew) FROM hebtest;
    

    Valid UTF-8 Hebrew consonants and diacritics should match the byte ranges for U+0591 to U+05F4 (e.g., D6 91 for U+0591).

  • Normalization: Convert text to NFC if possible. Use external tools (e.g., ICU libraries) since SQLite lacks built-in normalization functions. For Tcl users:

    package require icu
    set normalized [::icu::normalize $hebrew NFC]
    

Step 2: Adapt String Operations to Code Point Logic

  • Count Code Points Explicitly: Use LENGTH() with the understanding that it counts code points. For grapheme cluster counts, iterate over the string and detect base characters followed by combining marks:

    WITH split AS (
      SELECT SUBSTR(hebrew, value, 1) AS chr
      FROM hebtest, generate_series(1, LENGTH(hebrew))
    )
    SELECT SUM(
      CASE 
        WHEN UNICODE(chr) BETWEEN 0x0591 AND 0x05F4 
          AND UNICODE(chr) NOT BETWEEN 0x05D0 AND 0x05EA THEN 0 
        ELSE 1 
      END
    ) AS logical_chars FROM split;
    

    This query counts base Hebrew consonants (U+05D0U+05EA) as 1 and diacritics as 0.

  • Substring by Grapheme Clusters: To split strings at logical glyph boundaries, use a lookup table of base characters and their combining marks:

    CREATE TABLE hebrew_diacritics (code INTEGER);
    INSERT INTO hebrew_diacritics VALUES
      (0x05B0), (0x05B1), ..., (0x05C7); -- Populate with all diacritic code points
    
    WITH clusters AS (
      SELECT 
        SUBSTR(hebrew, value, 1) AS chr,
        CASE 
          WHEN UNICODE(SUBSTR(hebrew, value, 1)) BETWEEN 0x05D0 AND 0x05EA THEN 1
          WHEN EXISTS (
            SELECT 1 FROM hebrew_diacritics 
            WHERE code = UNICODE(SUBSTR(hebrew, value, 1))
          ) THEN 0
          ELSE 1
        END AS is_base
      FROM hebtest, generate_series(1, LENGTH(hebrew))
    )
    SELECT GROUP_CONCAT(chr, '') 
    FROM (
      SELECT chr, SUM(is_base) OVER (ORDER BY rowid) AS cluster_id
      FROM clusters
    )
    GROUP BY cluster_id;
    

    This groups base characters with their subsequent diacritics into clusters.

Step 3: Application-Side Workarounds

  • Preprocess Text: Use a scripting language (e.g., Tcl, Python) to count grapheme clusters before inserting data into SQLite. For Tcl:
    proc count_clusters {str} {
      set clusters 0
      set len [string length $str]
      for {set i 0} {$i < $len} {incr i} {
        incr clusters
        set cp [unicode::toUnicode [string index $str $i]]
        while {[info exists combining($cp)] && [incr i] < $len} {
          set cp [unicode::toUnicode [string index $str $i]]
        }
      }
      return $clusters
    }
    
  • Postprocess Query Results: Rebuild glyph clusters after using SUBSTR():
    SELECT 
      GROUP_CONCAT(
        CASE 
          WHEN is_base THEN substr 
          ELSE '' 
        END, 
        ''
      ) AS parsed
    FROM (
      SELECT 
        SUBSTR(hebrew, value, 1) AS substr,
        UNICODE(SUBSTR(hebrew, value, 1)) BETWEEN 0x05D0 AND 0x05EA AS is_base
      FROM hebtest, generate_series(1, LENGTH(hebrew))
    );
    

Step 4: Schema and Query Design Adjustments

  • Store Pre-Split Strings: If the parsing logic is static (e.g., splitting at /), pre-split the text into columns during insertion:
    CREATE TABLE hebtest (
      indexRow INTEGER,
      hebrew TEXT,
      part1 TEXT,
      part2 TEXT
    );
    
  • Use Fixed-Length Columns: For liturgical texts with predictable structures, define columns for base consonants and separate columns for diacritics.

Final Note: SQLite’s behavior is correct but reflects the limitations of generic string functions with complex scripts. Solutions require combining database operations with application logic to account for Hebrew’s unique orthography.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *