Resolving UTF-8 Decoding and Encoding Errors in SQLite regexp_bytecode

Issue Overview: Incorrect UTF-8 Character Handling in regexp_bytecode

The regexp_bytecode function in SQLite’s regexp.c extension exhibited two critical UTF-8 processing errors in specific edge cases. These errors impacted both decoding (reading UTF-8 sequences) and encoding (writing UTF-8 sequences) operations, leading to incorrect results when handling Unicode characters.

UTF-8 Decoding Error for 4-Byte Characters at String End

When decoding a Unicode code point in the range U+10000 to U+10FFFF (which requires a 4-byte UTF-8 encoding) located at the end of an input string, the re_next_char function in regexp.c generated replacement characters (U+FFFD) instead of the correct code point. For example, the Unicode character U+1F4A9 (💩, "Pile of Poo") is represented in UTF-8 as F0 9F 92 A9. When passed to regexp_bytecode as a standalone string, SQLite 3.40.0 decoded it as four replacement characters (65533 in decimal), indicating a decoding failure. However, when the same character was embedded within parentheses (e.g., SELECT regexp_bytecode('(💩)')), it was decoded correctly. This inconsistency pointed to a buffer boundary check flaw in the UTF-8 decoder.

UTF-8 Encoding Error for 3-Byte Characters in Range U+0800–U+0FFF

The regexp_bytecode function also mishandled the encoding of Unicode characters in the range U+0800 to U+0FFF. These characters require a 3-byte UTF-8 encoding, but SQLite generated truncated 2-byte sequences. For example, the character U+0800 (က, "Myanmar Letter Ka") should be encoded as E0 A0 80 but was incorrectly written as E0 80. Similarly, U+0FFF (Tibetan digit nine) should be E0 BF BF but was encoded as FF BF. This truncation caused valid regex patterns containing these characters to fail or produce unexpected matches.

Possible Causes: Boundary Checks and Encoding Logic Flaws

Off-by-One Error in 4-Byte UTF-8 Decoding

The re_next_char function’s logic for decoding 4-byte UTF-8 sequences contained an off-by-one error in its buffer boundary check. The original code verified that p->i + 3 < p->mx (where p->i is the current read position and p->mx is the input length) before processing a 4-byte character. This check ensures that four bytes (current byte + three subsequent bytes) are available. However, when the 4-byte sequence was at the very end of the input, p->i + 3 equaled p->mx - 1, which passed the check. The error arose from an incorrect assumption about the indices of continuation bytes. The code erroneously checked the first byte of the sequence (already consumed) as a continuation byte and failed to validate the correct subsequent bytes, leading to replacement characters.

Incorrect Encoding Branch for 3-Byte UTF-8 Sequences

The re_compile function’s logic for writing UTF-8 prefixes (zInit) used an incorrect conditional check when encoding code points. The original code allowed code points up to 0xFFF (4095) to be encoded as 2-byte sequences. However, Unicode code points in the range U+0800–U+FFFF require 3-byte encoding. This mismatch caused characters like U+0800 to be truncated to 2 bytes (E0 80 instead of E0 A0 80). The root cause was an overly permissive threshold (x <= 0xFFF) in the encoding branch meant for 2-byte sequences.

Troubleshooting Steps, Solutions & Fixes

Step 1: Validate UTF-8 Decoding and Encoding Behavior

To confirm the presence of these issues, test regexp_bytecode with known problem cases:

Test 4-Byte Decoding:

SELECT regexp_bytecode(char(0x1f4a9)); -- Should return 128169 (U+1F4A9)
-- Incorrect output: Four instances of 65533 (U+FFFD)

Test 3-Byte Encoding:

SELECT regexp_bytecode('\u0800'); -- Should return UTF-8 bytes E0 A0 80
-- Incorrect output: E0 80 (invalid UTF-8)

Step 2: Apply the Patch to regexp.c

The following modifications to regexp.c resolve both issues:

Fix 4-Byte Decoding Boundary Check

Modify the buffer check for 4-byte sequences from p->i + 3 < p->mx to p->i + 2 < p->mx, ensuring the correct number of continuation bytes are available:

} else if( (c&0xf8)==0xf0 && p->i+2 < p->mx && (p->z[p->i]&0xc0)==0x80
     && (p->z[p->i+1]&0xc0)==0x80 && (p->z[p->i+2]&0xc0)==0x80 ){

Fix 3-Byte Encoding Threshold

Adjust the encoding conditional to restrict 2-byte sequences to code points ≤ 0x7FF (2047):

} else if( x<=0x7ff ){

Step 3: Rebuild SQLite and Verify Fixes

After patching regexp.c, recompile SQLite and rerun the tests:

Decoding Test Post-Fix:

SELECT regexp_bytecode(char(0x1f4a9)); -- Outputs 128169 (correct)

Encoding Test Post-Fix:

SELECT hex(regexp_bytecode('\u0800')); -- Outputs E0A080 (correct)

Explanation of the Fixes

  1. 4-Byte Decoding: The adjusted boundary check (p->i + 2 < p->mx) ensures three continuation bytes are available (total four bytes, including the initial 0xF0). This prevents out-of-bounds reads and validates the correct bytes as continuations.
  2. 3-Byte Encoding: Lowering the threshold to 0x7FF ensures code points ≥ 0x0800 use the 3-byte encoding branch, producing valid UTF-8.

Long-Term Prevention

  • UTF-8 Validation Suites: Integrate comprehensive tests for edge cases (e.g., 4-byte sequences at string boundaries).
  • Code Reviews: Audit boundary checks and encoding thresholds in text-processing functions.

By addressing these specific flaws in UTF-8 handling, SQLite’s regexp_bytecode function now correctly processes Unicode characters across all valid ranges.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *