Handling Non-Null-Terminated Strings in fts5TriTokenize: Buffer Overread Risks and Fixes
Understanding the fts5TriTokenize Buffer Overread Vulnerability
The Core Problem: UTF-8 Parsing and Input Boundary Checks
The fts5TriTokenize
function, part of SQLite’s Full-Text Search (FTS5) module, is designed to generate trigram tokens from input text for indexing and querying. A critical issue arises when this function processes input strings that are not null-terminated, particularly those with nText=0
(empty strings) or strings ending in the middle of a UTF-8 sequence. The root cause lies in how the READ_UTF8
macro interacts with the input buffer’s boundaries.
When fts5TriTokenize
processes a string, it relies on READ_UTF8
to decode UTF-8 characters incrementally. The macro is structured to read bytes until it encounters the end of the buffer (zTerm
) or a valid UTF-8 boundary. However, prior to fixes in SQLite’s source code, READ_UTF8
used a zIn != zTerm
comparison to determine whether to read the next byte. This comparison is insufficient because it does not account for cases where zIn
(the current read position) has already reached or exceeded zTerm
(the end of the buffer). In such cases, the macro could read one or more bytes beyond the allocated input buffer, leading to undefined behavior, including crashes, data leaks, or exposure of adjacent memory content.
For example, consider an empty input string (nText=0
, pText
pointing to arbitrary memory). The tokenizer would invoke READ_UTF8
unconditionally, attempting to read the first byte of pText
even though nText=0
implies the buffer has no valid data. If the input buffer is not null-terminated, this read operation accesses memory outside the intended bounds. Similarly, for non-empty strings ending in the middle of a multi-byte UTF-8 sequence, the tokenizer could read past zTerm
while attempting to decode the incomplete character.
This issue violates the xTokenize
API contract, which explicitly allows non-null-terminated input strings. The problem is not hypothetical: applications invoking xTokenize
directly (e.g., for testing custom tokenizers or preprocessing text) could trigger this behavior. Even if SQLite’s internal use of fts5TriTokenize
does not currently pass non-null-terminated strings, third-party code adhering to the API’s specifications might inadvertently expose the vulnerability.
Root Causes of the Buffer Overread
1. Insufficient Boundary Checks in READ_UTF8
The READ_UTF8
macro, as originally implemented, begins by unconditionally reading the first byte of the input (c = *(zIn++)
), regardless of whether zIn
has already reached zTerm
. This design assumes that the caller has already ensured zIn < zTerm
before invoking the macro. However, fts5TriTokenize
did not enforce this precondition, leading to scenarios where zIn == zTerm
at the start of a read operation. For example:
Empty Input (
nText=0
):zTerm
is set topText + nText
, which equalspText
whennText=0
. Iffts5TriTokenize
proceeds to callREAD_UTF8
without checkingzIn < zTerm
, the macro dereferencespText
(which may point to invalid or unmapped memory).Mid-UTF-8 Termination: If the input ends in the middle of a multi-byte UTF-8 character (e.g.,
nText=1
withpText
containing0xC2
, the start of a 2-byte sequence),READ_UTF8
will attempt to read the next byte even ifzIn
has already reachedzTerm
.
The original READ_UTF8
check for zIn != zTerm
during multi-byte decoding is also flawed. Using !=
instead of <
allows zIn
to advance beyond zTerm
if the loop starts with zIn == zTerm
. For instance, if zIn
is incremented past zTerm
during the first iteration, subsequent checks using !=
will fail to detect the overflow.
2. Lack of Prevalidation in fts5TriTokenize
The fts5TriTokenize
function’s main loop does not verify whether the input buffer is exhausted before invoking READ_UTF8
. This oversight is most apparent when processing empty strings: the loop proceeds to decode characters even when nText=0
. The tokenizer should first check if zIn < zTerm
before attempting to read any bytes. Without this guard clause, the macro will dereference zIn
even when the buffer is empty or fully consumed.
3. Reliance on Null-Terminators for Safety
While the xTokenize
API permits non-null-terminated input, fts5TriTokenize
implicitly relies on encountering a null byte (\0
) to halt parsing. This creates a contradiction: the function is expected to respect nText
as the authoritative buffer length but may read beyond nText
bytes if no null terminator exists. This behavior violates the principle of buffer safety and introduces dependencies on external memory layout.
Resolving the Overread: Code Fixes and Best Practices
1. Correcting READ_UTF8’s Boundary Checks
The first fix involves modifying the READ_UTF8
macro to use <
instead of !=
when comparing zIn
and zTerm
:
#define READ_UTF8(zIn, zTerm, c) \
c = *(zIn++); \
if( c >= 0xc0 ){ \
c = sqlite3Utf8Trans1[c - 0xc0]; \
while( zIn < zTerm && (*zIn & 0xc0) == 0x80 ){ \
c = (c << 6) + (0x3f & *(zIn++)); \
} \
/* ... error handling ... */ \
}
This change ensures that the loop terminates when zIn
reaches or exceeds zTerm
, preventing overreads during multi-byte decoding. However, this alone does not address the initial unconditional read of *(zIn++)
.
2. Adding Prevalidation in fts5TriTokenize
To prevent the initial overread, fts5TriTokenize
must check zIn < zTerm
before invoking READ_UTF8
. The corrected loop structure looks like this:
while( zIn < zTerm ){
READ_UTF8(zIn, zTerm, c);
// ... tokenization logic ...
}
For empty input (zIn == zTerm
), the loop is never entered, and READ_UTF8
is not called. This ensures that no bytes are read when nText=0
.
3. Comprehensive Test Cases
To validate the fixes, developers should test edge cases such as:
- Empty Strings:
nText=0
,pText
pointing to non-null memory. - Incomplete UTF-8 Sequences: Strings ending with partial multi-byte characters (e.g.,
0xC2
followed by no bytes). - No Null Terminator: Buffers with exact lengths (e.g.,
nText=3
,pText="abc"
with no trailing\0
).
These tests ensure that the tokenizer respects nText
and does not access memory beyond pText + nText
.
4. Implications for Custom Tokenizers
Developers implementing custom tokenizers should adopt similar safeguards:
- Always validate
zIn < zTerm
before reading bytes. - Avoid assumptions about null terminators.
- Use SQLite’s
sqlite3Fts5UnicodeIsAlnum
or similar functions for standardized UTF-8 handling.
By addressing the boundary checks in both READ_UTF8
and fts5TriTokenize
, SQLite ensures compliance with the xTokenize
API’s contract and eliminates risks associated with non-null-terminated inputs. These fixes underscore the importance of rigorous buffer management in low-level text processing functions, particularly when handling variable-width encodings like UTF-8.