Handling Non-Null-Terminated Strings in fts5TriTokenize: Buffer Overread Risks and Fixes


Understanding the fts5TriTokenize Buffer Overread Vulnerability

The Core Problem: UTF-8 Parsing and Input Boundary Checks

The fts5TriTokenize function, part of SQLite’s Full-Text Search (FTS5) module, is designed to generate trigram tokens from input text for indexing and querying. A critical issue arises when this function processes input strings that are not null-terminated, particularly those with nText=0 (empty strings) or strings ending in the middle of a UTF-8 sequence. The root cause lies in how the READ_UTF8 macro interacts with the input buffer’s boundaries.

When fts5TriTokenize processes a string, it relies on READ_UTF8 to decode UTF-8 characters incrementally. The macro is structured to read bytes until it encounters the end of the buffer (zTerm) or a valid UTF-8 boundary. However, prior to fixes in SQLite’s source code, READ_UTF8 used a zIn != zTerm comparison to determine whether to read the next byte. This comparison is insufficient because it does not account for cases where zIn (the current read position) has already reached or exceeded zTerm (the end of the buffer). In such cases, the macro could read one or more bytes beyond the allocated input buffer, leading to undefined behavior, including crashes, data leaks, or exposure of adjacent memory content.

For example, consider an empty input string (nText=0, pText pointing to arbitrary memory). The tokenizer would invoke READ_UTF8 unconditionally, attempting to read the first byte of pText even though nText=0 implies the buffer has no valid data. If the input buffer is not null-terminated, this read operation accesses memory outside the intended bounds. Similarly, for non-empty strings ending in the middle of a multi-byte UTF-8 sequence, the tokenizer could read past zTerm while attempting to decode the incomplete character.

This issue violates the xTokenize API contract, which explicitly allows non-null-terminated input strings. The problem is not hypothetical: applications invoking xTokenize directly (e.g., for testing custom tokenizers or preprocessing text) could trigger this behavior. Even if SQLite’s internal use of fts5TriTokenize does not currently pass non-null-terminated strings, third-party code adhering to the API’s specifications might inadvertently expose the vulnerability.


Root Causes of the Buffer Overread

1. Insufficient Boundary Checks in READ_UTF8

The READ_UTF8 macro, as originally implemented, begins by unconditionally reading the first byte of the input (c = *(zIn++)), regardless of whether zIn has already reached zTerm. This design assumes that the caller has already ensured zIn < zTerm before invoking the macro. However, fts5TriTokenize did not enforce this precondition, leading to scenarios where zIn == zTerm at the start of a read operation. For example:

  • Empty Input (nText=0): zTerm is set to pText + nText, which equals pText when nText=0. If fts5TriTokenize proceeds to call READ_UTF8 without checking zIn < zTerm, the macro dereferences pText (which may point to invalid or unmapped memory).

  • Mid-UTF-8 Termination: If the input ends in the middle of a multi-byte UTF-8 character (e.g., nText=1 with pText containing 0xC2, the start of a 2-byte sequence), READ_UTF8 will attempt to read the next byte even if zIn has already reached zTerm.

The original READ_UTF8 check for zIn != zTerm during multi-byte decoding is also flawed. Using != instead of < allows zIn to advance beyond zTerm if the loop starts with zIn == zTerm. For instance, if zIn is incremented past zTerm during the first iteration, subsequent checks using != will fail to detect the overflow.

2. Lack of Prevalidation in fts5TriTokenize

The fts5TriTokenize function’s main loop does not verify whether the input buffer is exhausted before invoking READ_UTF8. This oversight is most apparent when processing empty strings: the loop proceeds to decode characters even when nText=0. The tokenizer should first check if zIn < zTerm before attempting to read any bytes. Without this guard clause, the macro will dereference zIn even when the buffer is empty or fully consumed.

3. Reliance on Null-Terminators for Safety

While the xTokenize API permits non-null-terminated input, fts5TriTokenize implicitly relies on encountering a null byte (\0) to halt parsing. This creates a contradiction: the function is expected to respect nText as the authoritative buffer length but may read beyond nText bytes if no null terminator exists. This behavior violates the principle of buffer safety and introduces dependencies on external memory layout.


Resolving the Overread: Code Fixes and Best Practices

1. Correcting READ_UTF8’s Boundary Checks

The first fix involves modifying the READ_UTF8 macro to use < instead of != when comparing zIn and zTerm:

#define READ_UTF8(zIn, zTerm, c)               \
  c = *(zIn++);                                \
  if( c >= 0xc0 ){                             \
    c = sqlite3Utf8Trans1[c - 0xc0];           \
    while( zIn < zTerm && (*zIn & 0xc0) == 0x80 ){ \
      c = (c << 6) + (0x3f & *(zIn++));        \
    }                                          \
    /* ... error handling ... */               \
  }

This change ensures that the loop terminates when zIn reaches or exceeds zTerm, preventing overreads during multi-byte decoding. However, this alone does not address the initial unconditional read of *(zIn++).

2. Adding Prevalidation in fts5TriTokenize

To prevent the initial overread, fts5TriTokenize must check zIn < zTerm before invoking READ_UTF8. The corrected loop structure looks like this:

while( zIn < zTerm ){
  READ_UTF8(zIn, zTerm, c);
  // ... tokenization logic ...
}

For empty input (zIn == zTerm), the loop is never entered, and READ_UTF8 is not called. This ensures that no bytes are read when nText=0.

3. Comprehensive Test Cases

To validate the fixes, developers should test edge cases such as:

  • Empty Strings: nText=0, pText pointing to non-null memory.
  • Incomplete UTF-8 Sequences: Strings ending with partial multi-byte characters (e.g., 0xC2 followed by no bytes).
  • No Null Terminator: Buffers with exact lengths (e.g., nText=3, pText="abc" with no trailing \0).

These tests ensure that the tokenizer respects nText and does not access memory beyond pText + nText.

4. Implications for Custom Tokenizers

Developers implementing custom tokenizers should adopt similar safeguards:

  • Always validate zIn < zTerm before reading bytes.
  • Avoid assumptions about null terminators.
  • Use SQLite’s sqlite3Fts5UnicodeIsAlnum or similar functions for standardized UTF-8 handling.

By addressing the boundary checks in both READ_UTF8 and fts5TriTokenize, SQLite ensures compliance with the xTokenize API’s contract and eliminates risks associated with non-null-terminated inputs. These fixes underscore the importance of rigorous buffer management in low-level text processing functions, particularly when handling variable-width encodings like UTF-8.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *