Resolving UTF-8 Decoding and Encoding Errors in SQLite regexp_bytecode

Issue Overview: Incorrect UTF-8 Character Handling in regexp_bytecode

The regexp_bytecode function in SQLite’s regexp.c extension exhibited two critical UTF-8 processing errors in specific edge cases. These errors impacted both decoding (reading UTF-8 sequences) and encoding (writing UTF-8 sequences) operations, leading to incorrect results when handling Unicode characters.

UTF-8 Decoding Error for 4-Byte Characters at String End

When decoding a Unicode code point in the range U+10000 to U+10FFFF (which requires a 4-byte UTF-8 encoding) located at the end of an input string, the re_next_char function in regexp.c generated replacement characters (U+FFFD) instead of the correct code point. For example, the Unicode character U+1F4A9 (💩, "Pile of Poo") is represented in UTF-8 as F0 9F 92 A9. When passed to regexp_bytecode as a standalone string, SQLite 3.40.0 decoded it as four replacement characters (65533 in decimal), indicating a decoding failure. However, when the same character was embedded within parentheses (e.g., SELECT regexp_bytecode('(💩)')), it was decoded correctly. This inconsistency pointed to a buffer boundary check flaw in the UTF-8 decoder.

UTF-8 Encoding Error for 3-Byte Characters in Range U+0800–U+0FFF

The regexp_bytecode function also mishandled the encoding of Unicode characters in the range U+0800 to U+0FFF. These characters require a 3-byte UTF-8 encoding, but SQLite generated truncated 2-byte sequences. For example, the character U+0800 (က, "Myanmar Letter Ka") should be encoded as E0 A0 80 but was incorrectly written as E0 80. Similarly, U+0FFF (Tibetan digit nine) should be E0 BF BF but was encoded as FF BF. This truncation caused valid regex patterns containing these characters to fail or produce unexpected matches.

Possible Causes: Boundary Checks and Encoding Logic Flaws

Off-by-One Error in 4-Byte UTF-8 Decoding

The re_next_char function’s logic for decoding 4-byte UTF-8 sequences contained an off-by-one error in its buffer boundary check. The original code verified that p->i + 3 < p->mx (where p->i is the current read position and p->mx is the input length) before processing a 4-byte character. This check ensures that four bytes (current byte + three subsequent bytes) are available. However, when the 4-byte sequence was at the very end of the input, p->i + 3 equaled p->mx - 1, which passed the check. The error arose from an incorrect assumption about the indices of continuation bytes. The code erroneously checked the first byte of the sequence (already consumed) as a continuation byte and failed to validate the correct subsequent bytes, leading to replacement characters.

Incorrect Encoding Branch for 3-Byte UTF-8 Sequences

The re_compile function’s logic for writing UTF-8 prefixes (zInit) used an incorrect conditional check when encoding code points. The original code allowed code points up to 0xFFF (4095) to be encoded as 2-byte sequences. However, Unicode code points in the range U+0800–U+FFFF require 3-byte encoding. This mismatch caused characters like U+0800 to be truncated to 2 bytes (E0 80 instead of E0 A0 80). The root cause was an overly permissive threshold (x <= 0xFFF) in the encoding branch meant for 2-byte sequences.

Troubleshooting Steps, Solutions & Fixes

Step 1: Validate UTF-8 Decoding and Encoding Behavior

To confirm the presence of these issues, test regexp_bytecode with known problem cases:

Test 4-Byte Decoding:

SELECT regexp_bytecode(char(0x1f4a9)); -- Should return 128169 (U+1F4A9)
-- Incorrect output: Four instances of 65533 (U+FFFD)

Test 3-Byte Encoding:

SELECT regexp_bytecode('\u0800'); -- Should return UTF-8 bytes E0 A0 80
-- Incorrect output: E0 80 (invalid UTF-8)

Step 2: Apply the Patch to regexp.c

The following modifications to regexp.c resolve both issues:

Fix 4-Byte Decoding Boundary Check

Modify the buffer check for 4-byte sequences from p->i + 3 < p->mx to p->i + 2 < p->mx, ensuring the correct number of continuation bytes are available:

} else if( (c&0xf8)==0xf0 && p->i+2 < p->mx && (p->z[p->i]&0xc0)==0x80
     && (p->z[p->i+1]&0xc0)==0x80 && (p->z[p->i+2]&0xc0)==0x80 ){

Fix 3-Byte Encoding Threshold

Adjust the encoding conditional to restrict 2-byte sequences to code points ≤ 0x7FF (2047):

} else if( x<=0x7ff ){

Step 3: Rebuild SQLite and Verify Fixes

After patching regexp.c, recompile SQLite and rerun the tests:

Decoding Test Post-Fix:

SELECT regexp_bytecode(char(0x1f4a9)); -- Outputs 128169 (correct)

Encoding Test Post-Fix:

SELECT hex(regexp_bytecode('\u0800')); -- Outputs E0A080 (correct)

Explanation of the Fixes

4-Byte Decoding: The adjusted boundary check (p->i + 2 < p->mx) ensures three continuation bytes are available (total four bytes, including the initial 0xF0). This prevents out-of-bounds reads and validates the correct bytes as continuations.
3-Byte Encoding: Lowering the threshold to 0x7FF ensures code points ≥ 0x0800 use the 3-byte encoding branch, producing valid UTF-8.

Long-Term Prevention

UTF-8 Validation Suites: Integrate comprehensive tests for edge cases (e.g., 4-byte sequences at string boundaries).
Code Reviews: Audit boundary checks and encoding thresholds in text-processing functions.

By addressing these specific flaws in UTF-8 handling, SQLite’s regexp_bytecode function now correctly processes Unicode characters across all valid ranges.

Resolving UTF-8 Decoding and Encoding Errors in SQLite regexp_bytecode

Issue Overview: Incorrect UTF-8 Character Handling in regexp_bytecode

UTF-8 Decoding Error for 4-Byte Characters at String End

UTF-8 Encoding Error for 3-Byte Characters in Range U+0800–U+0FFF

Possible Causes: Boundary Checks and Encoding Logic Flaws

Off-by-One Error in 4-Byte UTF-8 Decoding

Incorrect Encoding Branch for 3-Byte UTF-8 Sequences

Troubleshooting Steps, Solutions & Fixes

Step 1: Validate UTF-8 Decoding and Encoding Behavior

Step 2: Apply the Patch to regexp.c

Fix 4-Byte Decoding Boundary Check

Fix 3-Byte Encoding Threshold

Step 3: Rebuild SQLite and Verify Fixes

Explanation of the Fixes

Long-Term Prevention

SQLite Connection Stability and Best Practices for Long-Running Applications

Missing 32-bit Precompiled Binaries for SQLite on Windows: Building from Source

Segmentation Fault in SQLite CLI Due to NULL Pointer in strlen()

Choosing the Best Free Development Environment for SQLite3 with GUI Support

Debug Build Assertion Failure in Geopoly Virtual Table NATURAL JOIN Queries

Resolving LOAD_EXTENSION Initialization Errors for SQLite Regexp Extension on macOS

Leave a Reply Cancel reply

Issue Overview: Incorrect UTF-8 Character Handling in regexp_bytecode

UTF-8 Decoding Error for 4-Byte Characters at String End

UTF-8 Encoding Error for 3-Byte Characters in Range U+0800–U+0FFF

Possible Causes: Boundary Checks and Encoding Logic Flaws

Off-by-One Error in 4-Byte UTF-8 Decoding

Incorrect Encoding Branch for 3-Byte UTF-8 Sequences

Troubleshooting Steps, Solutions & Fixes

Step 1: Validate UTF-8 Decoding and Encoding Behavior

Step 2: Apply the Patch to regexp.c

Fix 4-Byte Decoding Boundary Check

Fix 3-Byte Encoding Threshold

Step 3: Rebuild SQLite and Verify Fixes

Explanation of the Fixes

Long-Term Prevention

Related Guides

Leave a Reply Cancel reply