FTS5 Phrase Queries with Custom Tokenizers and Escaping Reserved Characters
Issue Overview: FTS5 PhTS5 Phrase Matching Fails with Spaces and Reserved Characters in Custom Tokenizer Configuration
The core issue revolves around configuring SQLite’s FTS5 virtual table to tokenize text using commas as separators while attempting to execute MATCH
queries that include spaces and reserved characters (e.g., ;
). The problem manifests in two distinct but related scenarios:
Tokenization Behavior with Spaces:
When using a custom tokenizer that designates commas (','
) as the sole separators, spaces are no longer treated as token boundaries. This causes standardMATCH 'hi mom'
queries to fail because the default FTS5 parser interpretshi
andmom
as separate tokens. Since spaces are now part of tokens (not separators), unquoted searches forhi mom
look for rows containing both tokenshi
andmom
, which do not exist if the indexed text is stored as a single token likehi mom
. Wrapping the search term in double quotes ("hi mom"
) forces FTS5 to treat the entire string as a single phrase, enabling a match. However, users often expect unquoted terms to work intuitively, leading to confusion.Escaping Reserved Characters in Phrase Queries:
Reserved characters like semicolons (;
) must be escaped in FTS5 queries. Escaping conflicts arise when combining reserved characters with double-quoted phrases. For example, a search forhas ";" space
requires escaping the semicolon while also delimiting the phrase with double quotes. Misplaced or insufficient escaping leads to syntax errors (e.g.,'"has ";" space"'
) or failed matches (e.g.,'has ";" space'
). The root complication lies in SQLite’s escaping rules for FTS5 phrases, where double quotes inside phrases must be doubled (SQL-style escaping), and reserved characters require careful handling depending on their position within tokens.
Possible Causes: Custom Tokenizer Misconfiguration and Escaping Rule Ambiguity
Cause 1: Tokenizer Separators Override Default Whitespace Handling
The CREATE VIRTUAL TABLE
statement configures the tokenizer using unicode61
with custom separators (','
). This configuration explicitly removes whitespace (including spaces) from the list of token separators. Consequently:
- Text like
"hi mom"
is stored as a single token because the tokenizer splits only on commas. - Unquoted
MATCH 'hi mom'
searches for rows containing bothhi
andmom
as separate tokens, which do not exist in the indexed data. - Double-quoted
MATCH '"hi mom"'
searches for the exact phrasehi mom
, which matches the stored token.
This behavior is consistent with FTS5’s tokenization rules but counterintuitive for users expecting spaces to function as implicit separators. The confusion stems from conflating FTS5’s phrase-matching syntax with its tokenization logic.
Cause 2: Improper Escaping of Double Quotes and Reserved Characters
FTS5 reserves certain characters ("
, '
, \
, `
, ^
, ~
, *
, ?
, :
, !
, $
, &
, |
, #
, ;
, ,
, (
,
)
, [
, ]
, {
, }
) for query syntax. To include these in search terms:
- Double quotes (
"
) within phrases must be escaped by doubling them (e.g.,"a ""quoted"" term"
). - Reserved characters like
;
do not require escaping unless they appear in positions that conflict with query operators (e.g., proximity search syntax).
In the example 'has ";" space'
:
- The unquoted semicolon is treated as part of the token
";"
, but the lack of phrase delimiters causes FTS5 to parsehas
,;
, andspace
as separate tokens. If the indexed text containshas ";" space
as a single token (due to comma-based tokenization), this query fails because FTS5 looks for three distinct tokens. - Wrapping the entire term in double quotes without proper escaping (
'"has ";" space"'
) results in a syntax error because the inner"
are unescaped, breaking the phrase boundary.
Cause 3: Ambiguity Between Token Content and Query Syntax
FTS5’s query parser uses reserved characters for advanced features like proximity operators (NEAR
), Boolean logic (AND
, OR
), and column filters. When reserved characters appear in indexed text, they must either be:
- Escaped in queries if they conflict with query syntax.
- Omitted from tokenization via the tokenizer configuration.
The example 'no";"space'
works because the semicolon is enclosed in quotes and treated as part of the token ";"
. However, 'has ";" space'
fails because the parser interprets the semicolon as a reserved character outside a quoted phrase, leading to a syntax error or unintended tokenization.
Troubleshooting Steps, Solutions & Fixes: Configuring Tokenizers and Escaping Queries Correctly
Step 1: Validate Tokenization Behavior
Before addressing escaping issues, verify how the custom tokenizer processes input text. Use the fts5vocab
virtual table to inspect tokens:
-- Create a vocabulary table for my_fts
CREATE VIRTUAL TABLE my_fts_vocab USING fts5vocab(my_fts, 'row');
-- Insert sample data
INSERT INTO my_fts (id, text) VALUES (1, 'hi mom,hello;world');
INSERT INTO my_fts (id, text) VALUES (2, 'has ";" space');
-- Query tokens
SELECT term, doc FROM my_fts_vocab WHERE col = 'text';
Expected output (assuming comma separators):
term | doc |
---|---|
hi mom | 1 |
hello;world | 1 |
has ";" space | 2 |
If spaces are preserved in tokens (e.g., hi mom
), unquoted MATCH 'hi mom'
will fail because FTS5 looks for separate hi
and mom
tokens. Only phrase queries (MATCH '"hi mom"'
) will match.
Solution:
Educate users to always use double quotes for multi-word terms when spaces are not token separators. Alternatively, reconfigure the tokenizer to include spaces as separators:
CREATE VIRTUAL TABLE my_fts USING fts5(
id,
text,
tokenize = "unicode61 separators ' ,'"
);
Here, the separators
parameter includes both space and comma, allowing MATCH 'hi mom'
to find rows where hi
and mom
are separated by any delimiter.
Step 2: Escape Double Quotes in Phrase Queries
To include double quotes within a phrase, escape them by doubling:
-- Search for 'has ";" space'
SELECT * FROM my_fts WHERE text MATCH '"has "";"" space"';
This query:
- Uses outer double quotes to delimit the phrase.
- Escapes inner
"
as""
. - Treats
";"
as part of the token.
Common Pitfalls:
- Mixing single and double quotes:
'"has ";" space"'
is invalid because the outer single quotes are SQL string delimiters, and the inner"
are unescaped. - Forgetting to escape inner quotes:
"has ";" space"
(unescaped) breaks the phrase boundary afterhas
.
Step 3: Handle Reserved Characters in Queries
Reserved characters like ;
do not require escaping unless they conflict with query syntax. For example:
-- Search for 'no";"space'
SELECT * FROM my_fts WHERE text MATCH 'no";"space';
This works because the semicolon is part of the token ";"
and does not interfere with the query parser. However, if the semicolon appears outside a quoted token, it may cause issues:
-- Problematic: Semicolon as a query operator
SELECT * FROM my_fts WHERE text MATCH 'has ; space';
Here, ;
is interpreted as a query operator (unused in FTS5), leading to a parse error. To include reserved characters in unquoted terms, use:
-- Escape reserved characters with backslash (if enabled)
SELECT * FROM my_fts WHERE text MATCH 'has \; space';
Note: By default, FTS5 does not support backslash escaping. To enable it, configure the tokenizer with remove_diacritics 0
:
CREATE VIRTUAL TABLE my_fts USING fts5(
id,
text,
tokenize = "unicode61 separators ' ,' tokenchars ';'"
);
The tokenchars
parameter specifies that ;
should be treated as part of tokens, not separators or operators.
Step 4: Use Alternative Delimiters for Phrases
For scenarios where double quotes conflict with content, use column filters or NEAR operators:
-- Search for 'has ";" space' using column filter
SELECT * FROM my_fts WHERE text MATCH 'text:"has "";"" space"';
This explicitly scopes the phrase to the text
column, reducing ambiguity.
Step 5: Reconfigure Tokenizer for Complex Characters
If reserved characters are prevalent in data, adjust the tokenizer to include them as token characters:
CREATE VIRTUAL TABLE my_fts USING fts5(
id,
text,
tokenize = "unicode61 tokenchars ';' separators ','"
);
This configuration:
- Treats
;
as part of tokens. - Uses commas as separators.
- Allows
MATCH '";"'
to find tokens containing semicolons.
Final Solution Summary
- For spaces in phrases: Always use double quotes around multi-word terms when spaces are not token separators.
- For reserved characters:
- Enclose phrases in double quotes.
- Escape inner double quotes by doubling them (
""
). - Use
tokenchars
to include reserved characters in tokens.
- For ambiguous queries: Use column filters (
text:"phrase"
) or reconfigure the tokenizer to exclude problematic characters as separators.
By aligning tokenizer settings with query syntax requirements and rigorously escaping phrases, users can reliably search for terms containing spaces and reserved characters in FTS5.