Resolving UTF-8 Decoding and Encoding Errors in SQLite regexp_bytecode
Issue Overview: Incorrect UTF-8 Character Handling in regexp_bytecode
The regexp_bytecode
function in SQLite’s regexp.c
extension exhibited two critical UTF-8 processing errors in specific edge cases. These errors impacted both decoding (reading UTF-8 sequences) and encoding (writing UTF-8 sequences) operations, leading to incorrect results when handling Unicode characters.
UTF-8 Decoding Error for 4-Byte Characters at String End
When decoding a Unicode code point in the range U+10000 to U+10FFFF (which requires a 4-byte UTF-8 encoding) located at the end of an input string, the re_next_char
function in regexp.c
generated replacement characters (U+FFFD) instead of the correct code point. For example, the Unicode character U+1F4A9 (💩, "Pile of Poo") is represented in UTF-8 as F0 9F 92 A9
. When passed to regexp_bytecode
as a standalone string, SQLite 3.40.0 decoded it as four replacement characters (65533
in decimal), indicating a decoding failure. However, when the same character was embedded within parentheses (e.g., SELECT regexp_bytecode('(💩)')
), it was decoded correctly. This inconsistency pointed to a buffer boundary check flaw in the UTF-8 decoder.
UTF-8 Encoding Error for 3-Byte Characters in Range U+0800–U+0FFF
The regexp_bytecode
function also mishandled the encoding of Unicode characters in the range U+0800 to U+0FFF. These characters require a 3-byte UTF-8 encoding, but SQLite generated truncated 2-byte sequences. For example, the character U+0800 (က, "Myanmar Letter Ka") should be encoded as E0 A0 80
but was incorrectly written as E0 80
. Similarly, U+0FFF (Tibetan digit nine) should be E0 BF BF
but was encoded as FF BF
. This truncation caused valid regex patterns containing these characters to fail or produce unexpected matches.
Possible Causes: Boundary Checks and Encoding Logic Flaws
Off-by-One Error in 4-Byte UTF-8 Decoding
The re_next_char
function’s logic for decoding 4-byte UTF-8 sequences contained an off-by-one error in its buffer boundary check. The original code verified that p->i + 3 < p->mx
(where p->i
is the current read position and p->mx
is the input length) before processing a 4-byte character. This check ensures that four bytes (current byte + three subsequent bytes) are available. However, when the 4-byte sequence was at the very end of the input, p->i + 3
equaled p->mx - 1
, which passed the check. The error arose from an incorrect assumption about the indices of continuation bytes. The code erroneously checked the first byte of the sequence (already consumed) as a continuation byte and failed to validate the correct subsequent bytes, leading to replacement characters.
Incorrect Encoding Branch for 3-Byte UTF-8 Sequences
The re_compile
function’s logic for writing UTF-8 prefixes (zInit
) used an incorrect conditional check when encoding code points. The original code allowed code points up to 0xFFF
(4095) to be encoded as 2-byte sequences. However, Unicode code points in the range U+0800–U+FFFF require 3-byte encoding. This mismatch caused characters like U+0800 to be truncated to 2 bytes (E0 80
instead of E0 A0 80
). The root cause was an overly permissive threshold (x <= 0xFFF
) in the encoding branch meant for 2-byte sequences.
Troubleshooting Steps, Solutions & Fixes
Step 1: Validate UTF-8 Decoding and Encoding Behavior
To confirm the presence of these issues, test regexp_bytecode
with known problem cases:
Test 4-Byte Decoding:
SELECT regexp_bytecode(char(0x1f4a9)); -- Should return 128169 (U+1F4A9)
-- Incorrect output: Four instances of 65533 (U+FFFD)
Test 3-Byte Encoding:
SELECT regexp_bytecode('\u0800'); -- Should return UTF-8 bytes E0 A0 80
-- Incorrect output: E0 80 (invalid UTF-8)
Step 2: Apply the Patch to regexp.c
The following modifications to regexp.c
resolve both issues:
Fix 4-Byte Decoding Boundary Check
Modify the buffer check for 4-byte sequences from p->i + 3 < p->mx
to p->i + 2 < p->mx
, ensuring the correct number of continuation bytes are available:
} else if( (c&0xf8)==0xf0 && p->i+2 < p->mx && (p->z[p->i]&0xc0)==0x80
&& (p->z[p->i+1]&0xc0)==0x80 && (p->z[p->i+2]&0xc0)==0x80 ){
Fix 3-Byte Encoding Threshold
Adjust the encoding conditional to restrict 2-byte sequences to code points ≤ 0x7FF
(2047):
} else if( x<=0x7ff ){
Step 3: Rebuild SQLite and Verify Fixes
After patching regexp.c
, recompile SQLite and rerun the tests:
Decoding Test Post-Fix:
SELECT regexp_bytecode(char(0x1f4a9)); -- Outputs 128169 (correct)
Encoding Test Post-Fix:
SELECT hex(regexp_bytecode('\u0800')); -- Outputs E0A080 (correct)
Explanation of the Fixes
- 4-Byte Decoding: The adjusted boundary check (
p->i + 2 < p->mx
) ensures three continuation bytes are available (total four bytes, including the initial0xF0
). This prevents out-of-bounds reads and validates the correct bytes as continuations. - 3-Byte Encoding: Lowering the threshold to
0x7FF
ensures code points ≥0x0800
use the 3-byte encoding branch, producing valid UTF-8.
Long-Term Prevention
- UTF-8 Validation Suites: Integrate comprehensive tests for edge cases (e.g., 4-byte sequences at string boundaries).
- Code Reviews: Audit boundary checks and encoding thresholds in text-processing functions.
By addressing these specific flaws in UTF-8 handling, SQLite’s regexp_bytecode
function now correctly processes Unicode characters across all valid ranges.