Handling User Input Sanitization and Syntax Errors in SQLite FTS5 Queries
Understanding FTS5 Query Syntax and Input Sanitization Challenges
SQLite’s FTS5 extension provides powerful full-text search capabilities but introduces unique challenges when handling user-supplied search terms. The core issue revolves around the interaction between FTS5’s specialized query syntax and the need to safely incorporate arbitrary user input into search patterns. Unlike standard SQL queries where parameter binding provides robust protection against injection attacks, FTS5 MATCH clauses require careful handling of special characters and syntax operators.
FTS5 interprets search patterns using a specific grammar where spaces act as implicit AND operators, double quotes define phrase matches, and certain characters like apostrophes, commas, and operators (AND/OR/NEAR) carry special meanings. For example, a search term like API's
contains an apostrophe that FTS5 interprets as a phrase boundary if not properly escaped, while context-free
contains a hyphen that FTS5 treats as a token separator. The challenge intensifies when users combine these elements with intentional operators – e.g., "machine learning" OR "deep learning"
requires preserving the OR operator while properly escaping the quotes around phrases.
A critical implementation detail is FTS5’s handling of double quotes versus single quotes. While SQLite string literals typically use single quotes, FTS5 phrase matching requires double quotes. This creates a dual-layer escaping problem: first for SQL string literals, then for FTS5 syntax. For instance, searching for O'Neil
requires converting the apostrophe to two single quotes for SQL (O''Neil
), then wrapping in double quotes for FTS5 ("O''Neil"
). Failure to handle both layers correctly leads to syntax errors or unintended search behavior.
The original example demonstrates this complexity:
WHERE job_fts MATCH 'Python' -- Works
WHERE job_fts MATCH 'API's' -- Fails due to unescaped apostrophe
Proper sanitization requires transforming the raw input API's
into "API''s"
– doubling the apostrophe for SQL escaping and wrapping in quotes for FTS5 phrase recognition. However, this approach must be systematically applied to all special characters while preserving valid search operators when desired.
Common Pitfalls in FTS5 Query Construction and Sanitization
Three primary failure modes emerge when handling user input for FTS5 queries:
Incomplete Character Escaping: FTS5 recognizes 20+ special characters including
"
,'
,^
,@
,{
,}
,[
,]
,,
, and.
, each requiring specific handling. A blacklist-based approach like the initialsanitize_input
function risks missing critical characters. For example, the caret (^
) acts as a column filter operator in FTS5 (column:term
), so unescaped user input containingprice^10
could unexpectedly target theprice
column.Operator Collision: Users might intentionally or accidentally include FTS5 operators like AND/OR/NOT/NEAR. The query
python OR sql
is valid FTS5 syntax, but naive sanitization that wraps the entire string in quotes ("python OR sql"
) converts it into a phrase search for the literal string "python OR sql" instead of matching documents containing either term. Conversely, failing to handle intentional operators when desired leads to broken advanced searches.Phrase Boundary Conflicts: User input containing natural language phrases often includes characters that conflict with FTS5’s phrase syntax. A search for
C# developer
requires handling the#
symbol (which FTS5 interprets as a token separator) while preserving it as part of the token. Improper handling could either splitC#
into separate termsC
and#
or cause syntax errors if not properly quoted.
The interaction between SQL string escaping and FTS5 syntax escaping creates particularly subtle bugs. Consider a search term containing both apostrophes and quotes: Can't handle "this"
. Proper sanitization requires:
- Escaping single quotes for SQL:
Can''t handle "this"
- Escaping double quotes for FTS5:
"Can''t handle ""this"""
Layering these transformations incorrectly (e.g., escaping quotes before apostrophes) leads to malformed queries. This dual escaping requirement explains why parameter binding alone doesn’t solve the problem – while it prevents SQL injection, it doesn’t address FTS5 syntax issues. Parameter binding passes the raw string to FTS5, which then interprets any special characters in its own syntax context.
Strategies for Secure and Robust FTS5 Query Handling
Step 1: Establish Input Processing Layers
Implement a three-stage sanitization pipeline:
Syntax Validation: Detect and handle FTS5 operators based on application requirements. Decide whether to support advanced operators (OR/AND/NOT) or treat them as literal terms. For example, you might allow operators only in "expert mode" searches while automatically quoting them in basic searches.
Character Normalization: Convert Unicode quotes to ASCII equivalents, normalize whitespace, and handle control characters. Use Python’s
unicodedata.normalize()
combined with regex substitution:import unicodedata import re def normalize_input(text): text = unicodedata.normalize('NFKC', text) # Convert fancy quotes to ASCII text = re.sub(r'[\x00-\x1F]', ' ', text) # Strip control characters return text.strip()
FTS5-Specific Escaping: Apply context-aware escaping for FTS5 syntax. Instead of blacklisting characters, use a whitelist approach that splits terms and wraps each in quotes:
def fts5_escape(term): # Handle wildcards only at term end term = term.rstrip('*') wildcard = '*' if '*' in term else '' # Escape FTS5 special characters: " \ and double existing quotes term = term.replace('"', '""').replace('\\', '\\\\') return f'"{term}"{wildcard}' def build_fts_query(user_input): terms = user_input.split() return ' '.join(fts5_escape(term) for term in terms)
This transforms
API's C#
into"API''s" "C#"
, properly escaping quotes for SQL and FTS5 while preserving the#
symbol.
Step 2: Implement Retry Logic with Fallback Sanitization
As suggested in the discussion, attempt the user’s raw input first to preserve intentional operators, then fall back to sanitized terms if it fails:
def execute_fts_query(query, retry=True):
try:
cursor.execute('SELECT ... WHERE job_fts MATCH ?', (query,))
return cursor.fetchall()
except sqlite3.OperationalError as e:
if 'syntax error' in str(e) and retry:
sanitized = build_fts_query(query)
return execute_fts_query(sanitized, retry=False)
raise
# Usage:
results = execute_fts_query('python OR sql') # First attempt with raw input
This approach honors valid FTS5 syntax while automatically handling malformed queries. Log the original input and sanitized version for debugging.
Step 3: Context-Aware Term Splitting and Operator Handling
For applications allowing advanced searches, parse the query to distinguish intentional operators from literal terms:
import re
def parse_fts_query(text):
# Tokenize preserving quoted phrases and operators
tokens = re.findall(r'("[^"]*"|\S+|\b(?:AND|OR|NOT|NEAR)\b)', text, re.IGNORECASE)
cleaned = []
for token in tokens:
token = token.strip()
if token.upper() in {'AND', 'OR', 'NOT', 'NEAR'}:
cleaned.append(token.upper())
else:
if not token.startswith('"'):
token = fts5_escape(token)
cleaned.append(token)
return ' '.join(cleaned)
This regex-based tokenizer handles:
"quoted phrases"
as single tokens- Uppercased AND/OR/NOT as operators
- Other terms as individual quoted phrases
For input: python OR "API's"
, it produces "python" OR "API''s"
, preserving the operator while escaping the apostrophe.
Step 4: Comprehensive Character Handling Matrix
Create a translation table for all FTS5-significant characters, defining how each should be handled:
Character | Handling |
---|---|
" | Escape as "" within terms, preserve as phrase boundaries |
‘ | Escape as ” for SQL, preserve within FTS5 terms |
* | Allow only at term end as wildcard, strip elsewhere |
: | Escape unless used in column filter (e.g., title:term) |
– | Preserve as term character (e.g., "context-free") |
~ | Escape if not used as NEAR operator (NEAR/~) |
Implement this with a regex substitution chain:
def sanitize_fts_term(term):
# Handle wildcards
term, wildcard = re.subn(r'(?<!\.)\*+$', '*', term)
wildcard = '*' if wildcard else ''
# Escape internal quotes and backslashes
term = term.translate(str.maketrans({'"': '""', '\\': '\\\\'}))
# Remove other operators unless part of allowed syntax
term = re.sub(r'[^"\w\s\-~:]', ' ', term)
return f'"{term}"{wildcard}'
Step 5: Integration with Parameter Binding
Contrary to initial assumptions, SQLite parameters can be used with FTS5 MATCH clauses when properly escaped:
# Python example using sqlite3 module
query = """
SELECT bm25(job_fts) as score,
highlight(job_fts, 1, '<b>', '</b>')
FROM job_fts
WHERE job_fts MATCH ?
ORDER BY score
"""
user_input = 'python OR "API''s"' # Properly escaped input
cursor.execute(query, (user_input,))
Parameter binding prevents SQL injection but does not escape FTS5 syntax. The input must still be formatted according to FTS5 rules. This separation of concerns means:
- Use parameter binding for SQL injection protection
- Preprocess input for FTS5 syntax validity
Step 6: Comprehensive Testing Matrix
Validate your implementation against edge cases:
Input | Expected FTS5 Query | Notes |
---|---|---|
API's | "API''s" | Escaped apostrophe in phrase |
C# | "C#" | Preserved symbol |
AND OR NOT | "AND" "OR" "NOT" | Operators treated as terms |
python NEAR/5 sql | "python" NEAR/5 "sql" | Preserved operator with proximity |
file:C:\path | file:"C:\\path" | Escaped backslash for column filter |
mal"icious | "mal""icious" | Escaped quote within term |
Automate these tests using your database layer to catch regressions.
Final Implementation Checklist
- Input Normalization: Remove control characters, normalize Unicode
- Syntax Analysis: Detect and handle FTS5 operators based on policy
- Term Splitting: Split on whitespace, preserve quoted phrases
- Character Escaping: Layer SQL and FTS5 escaping appropriately
- Parameter Binding: Always use ? placeholders for MATCH clauses
- Error Handling: Retry with sanitized input after syntax errors
- Wildcard Management: Restrict * to term ends if supporting prefix searches
- Column Filter Sanitization: Validate column names in
column:term
patterns
By systematically applying these strategies, you create a robust FTS5 query handler that balances security with search flexibility. The key insight is recognizing that FTS5 sanitization operates at a higher layer than SQL injection prevention – both must be addressed, but with different techniques tailored to each layer’s requirements.