FTS5 NEAR Operator Phrase Requirements and Query Parsing Anomalies

NEAR Operator Syntax Validation, Integer Overflow, and Null Byte Handling in FTS5 Queries

Issue Overview: NEAR Operator Single Phrase Acceptance, Unchecked Overflow in NEAR Distance Parameter, and Premature Query Termination Due to Null Bytes

The FTS5 extension in SQLite provides powerful full-text search capabilities, but its implementation of the NEAR operator and query parsing exhibits three critical anomalies that compromise query integrity and predictability.

1. NEAR Operator Accepts Single Phrase Despite Requiring Two Phrases
The NEAR operator is explicitly defined in the FTS5 documentation to require at least two phrases as operands. For example, NEAR(phrase1 phrase2) or NEAR(phrase1 phrase2, N) where N is the maximum allowed distance between terms. However, the current implementation allows queries like NEAR(one) with a single phrase to execute without errors. This violates the documented syntax and introduces ambiguity. When a single phrase is provided, the operator behaves as though the query were simply the phrase itself, bypassing the intended proximity logic. This leniency can lead to unintended search results, especially when dynamically generating queries or parsing user input.

2. Unchecked Integer Overflow in NEAR Distance Parameter (N)
The NEAR operator optionally accepts an integer N to specify the maximum permitted distance between phrases. The implementation does not validate whether N is within the bounds of representable integer values. For instance, using an astronomically large value like 2^70 (1,180,591,620,717,411,303,424) is accepted without error. While SQLite internally represents integers as 64-bit signed values, the FTS5 extension does not enforce this limit during query parsing. In practice, extremely large N values may lead to integer overflow or undefined behavior in the underlying proximity calculation logic. The fts5ExprNearIsMatch function contains an assert statement to check for at least two phrases, but this is only active in debug builds, leaving release builds vulnerable to silent failures.

3. Premature Query Termination Due to Null Bytes
FTS5 query parsing stops at the first null byte (\0) encountered in the input string, even if subsequent characters form valid syntax. For example, a query like 'one\0AND OR NOT' is interpreted as 'one', ignoring the remainder of the string. Replacing the null byte with a space ('one AND OR NOT') correctly triggers a syntax error due to the invalid combination of operators. This behavior is inconsistent with standard SQLite string handling, where null bytes in bound parameters are typically treated as part of the string unless explicitly terminated by the application. The lack of proper null byte handling can lead to truncated queries, security vulnerabilities, or unexpected matches when binary data is inadvertently mixed with text.

Possible Causes: Parser Leniency, Absence of Overflow Checks, and Null Byte Interpretation

1. Parser Leniency in NEAR Operator Syntax
The FTS5 query parser’s handling of the NEAR operator is designed to prioritize flexibility over strict syntax validation. This approach allows partial or malformed queries to execute, assuming developers will adhere to documented syntax. The parser likely treats NEAR(one) as equivalent to one by discarding the operator structure when insufficient phrases are detected. This leniency stems from the parser’s tokenization logic, which does not enforce operand count requirements for operators.

2. Integer Representation and Overflow Mitigation Gaps
The absence of overflow checks for the N parameter in NEAR queries arises from SQLite’s reliance on 64-bit signed integers for internal calculations. While SQLite itself gracefully handles large integers via dynamic typing, the FTS5 extension does not clamp or validate the N value during query parsing. The fts5ExprNearIsMatch function uses a signed integer to store N, but the parsing logic does not reject values exceeding 2^63 - 1. This oversight could lead to negative distances when overflow occurs, though the impact is mitigated by the function’s use of absolute values in practice.

3. Null Byte Handling in String Tokenization
The FTS5 query parser terminates tokenization at the first null byte due to its reliance on C-style string functions, which interpret \0 as the end of a string. This behavior is inherited from SQLite’s underlying tokenization logic, which does not account for embedded nulls in text. When queries are passed via bound parameters, the SQLite engine does not sanitize or escape null bytes, allowing them to persist into the FTS5 parser. The result is a mismatch between the expected query syntax and the truncated input processed by FTS5.

Troubleshooting Steps, Solutions, and Fixes: Enforcing Phrase Requirements, Validating Integer Parameters, and Sanitizing Null Bytes

1. Enforcing Phrase Requirements in NEAR Operator
To ensure NEAR operators contain at least two phrases, implement pre-validation in the application layer before executing queries:

  • Regular Expression Validation: Use a regex pattern to check the structure of NEAR clauses. For example:

    NEAR\((\s*"[^"]+"|\s*'\w+'|\s*\w+){2,}\s*(,\s*\d+)?\s*\)
    

    This pattern matches NEAR followed by at least two phrases (quoted or unquoted) and an optional distance parameter.

  • Query Parsing Libraries: Integrate a dedicated FTS5 query parser (e.g., Python’s sqlite-fts4 or a custom lexer) to validate syntax before passing queries to SQLite.

  • SQLite Runtime Checks: For inline validation, use a CHECK constraint in a virtual table or trigger, though this is less efficient.

2. Validating and Sanitizing NEAR Distance Parameters
To prevent integer overflow in the N parameter:

  • Range Clamping: Restrict N to a reasonable range (e.g., 1–1000) based on application requirements. Example in Python:

    def sanitize_n(n):
        return max(1, min(n, 1000))
    
  • Type and Range Checks: Reject non-integer values and values exceeding 2^63 - 1 (9,223,372,036,854,775,807) before query execution.

  • SQLite C Extension: Modify the FTS5 source to include overflow checks in fts5ExprNearParse (see ext/fts5/fts5_expr.c). Add:

    if( n > SQLITE_MAX_INT32 ){
        sqlite3Fts5ParseError(pParse, "NEAR distance exceeds maximum value");
        return;
    }
    

3. Mitigating Null Byte Interference in FTS5 Queries
To handle null bytes in queries:

  • Input Sanitization: Strip null bytes from user inputs before constructing queries. Example in Python:

    query = user_input.replace('\0', ' ')
    
  • Parameter Binding Best Practices: Use bound parameters for all user-supplied inputs to avoid injection attacks and ensure proper encoding.

  • FTS5 Parser Modification: Patch the FTS5 tokenizer to replace null bytes with spaces or raise an error. In fts5ExprTokenize (in fts5_expr.c), add:

    if( c==0 ){
        sqlite3Fts5ParseError(pParse, "null byte in FTS5 query");
        return SQLITE_ERROR;
    }
    

Final Recommendations

  • Documentation Updates: Clarify NEAR operator requirements and null byte behavior in the official FTS5 docs.
  • Testing Suites: Implement unit tests covering edge cases (single-phrase NEAR, large N values, null bytes).
  • Community Reporting: File an issue on the SQLite GitHub repository or mailing list to request parser fixes.

By addressing these issues through input validation, query sanitization, and targeted code modifications, developers can ensure FTS5 queries behave as documented and resist common parsing anomalies.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *