FTS5 Phrase Queries with Custom Tokenizers and Escaping Reserved Characters

Issue Overview: FTS5 PhTS5 Phrase Matching Fails with Spaces and Reserved Characters in Custom Tokenizer Configuration

The core issue revolves around configuring SQLite’s FTS5 virtual table to tokenize text using commas as separators while attempting to execute MATCH queries that include spaces and reserved characters (e.g., ;). The problem manifests in two distinct but related scenarios:

  1. Tokenization Behavior with Spaces:
    When using a custom tokenizer that designates commas (',') as the sole separators, spaces are no longer treated as token boundaries. This causes standard MATCH 'hi mom' queries to fail because the default FTS5 parser interprets hi and mom as separate tokens. Since spaces are now part of tokens (not separators), unquoted searches for hi mom look for rows containing both tokens hi and mom, which do not exist if the indexed text is stored as a single token like hi mom. Wrapping the search term in double quotes ("hi mom") forces FTS5 to treat the entire string as a single phrase, enabling a match. However, users often expect unquoted terms to work intuitively, leading to confusion.

  2. Escaping Reserved Characters in Phrase Queries:
    Reserved characters like semicolons (;) must be escaped in FTS5 queries. Escaping conflicts arise when combining reserved characters with double-quoted phrases. For example, a search for has ";" space requires escaping the semicolon while also delimiting the phrase with double quotes. Misplaced or insufficient escaping leads to syntax errors (e.g., '"has ";" space"') or failed matches (e.g., 'has ";" space'). The root complication lies in SQLite’s escaping rules for FTS5 phrases, where double quotes inside phrases must be doubled (SQL-style escaping), and reserved characters require careful handling depending on their position within tokens.

Possible Causes: Custom Tokenizer Misconfiguration and Escaping Rule Ambiguity

Cause 1: Tokenizer Separators Override Default Whitespace Handling

The CREATE VIRTUAL TABLE statement configures the tokenizer using unicode61 with custom separators (','). This configuration explicitly removes whitespace (including spaces) from the list of token separators. Consequently:

  • Text like "hi mom" is stored as a single token because the tokenizer splits only on commas.
  • Unquoted MATCH 'hi mom' searches for rows containing both hi and mom as separate tokens, which do not exist in the indexed data.
  • Double-quoted MATCH '"hi mom"' searches for the exact phrase hi mom, which matches the stored token.

This behavior is consistent with FTS5’s tokenization rules but counterintuitive for users expecting spaces to function as implicit separators. The confusion stems from conflating FTS5’s phrase-matching syntax with its tokenization logic.

Cause 2: Improper Escaping of Double Quotes and Reserved Characters

FTS5 reserves certain characters (", ', \, `, ^, ~, *, ?, :, !, $, &, |, #, ;, ,, (,
), [, ], {, }) for query syntax. To include these in search terms:

  • Double quotes (") within phrases must be escaped by doubling them (e.g., "a ""quoted"" term").
  • Reserved characters like ; do not require escaping unless they appear in positions that conflict with query operators (e.g., proximity search syntax).

In the example 'has ";" space':

  • The unquoted semicolon is treated as part of the token ";", but the lack of phrase delimiters causes FTS5 to parse has, ;, and space as separate tokens. If the indexed text contains has ";" space as a single token (due to comma-based tokenization), this query fails because FTS5 looks for three distinct tokens.
  • Wrapping the entire term in double quotes without proper escaping ('"has ";" space"') results in a syntax error because the inner " are unescaped, breaking the phrase boundary.

Cause 3: Ambiguity Between Token Content and Query Syntax

FTS5’s query parser uses reserved characters for advanced features like proximity operators (NEAR), Boolean logic (AND, OR), and column filters. When reserved characters appear in indexed text, they must either be:

  • Escaped in queries if they conflict with query syntax.
  • Omitted from tokenization via the tokenizer configuration.

The example 'no";"space' works because the semicolon is enclosed in quotes and treated as part of the token ";". However, 'has ";" space' fails because the parser interprets the semicolon as a reserved character outside a quoted phrase, leading to a syntax error or unintended tokenization.

Troubleshooting Steps, Solutions & Fixes: Configuring Tokenizers and Escaping Queries Correctly

Step 1: Validate Tokenization Behavior

Before addressing escaping issues, verify how the custom tokenizer processes input text. Use the fts5vocab virtual table to inspect tokens:

-- Create a vocabulary table for my_fts
CREATE VIRTUAL TABLE my_fts_vocab USING fts5vocab(my_fts, 'row');

-- Insert sample data
INSERT INTO my_fts (id, text) VALUES (1, 'hi mom,hello;world');
INSERT INTO my_fts (id, text) VALUES (2, 'has ";" space');

-- Query tokens
SELECT term, doc FROM my_fts_vocab WHERE col = 'text';

Expected output (assuming comma separators):

termdoc
hi mom1
hello;world1
has ";" space2

If spaces are preserved in tokens (e.g., hi mom), unquoted MATCH 'hi mom' will fail because FTS5 looks for separate hi and mom tokens. Only phrase queries (MATCH '"hi mom"') will match.

Solution:
Educate users to always use double quotes for multi-word terms when spaces are not token separators. Alternatively, reconfigure the tokenizer to include spaces as separators:

CREATE VIRTUAL TABLE my_fts USING fts5(
  id, 
  text,
  tokenize = "unicode61 separators ' ,'"
);

Here, the separators parameter includes both space and comma, allowing MATCH 'hi mom' to find rows where hi and mom are separated by any delimiter.

Step 2: Escape Double Quotes in Phrase Queries

To include double quotes within a phrase, escape them by doubling:

-- Search for 'has ";" space'
SELECT * FROM my_fts WHERE text MATCH '"has "";"" space"';

This query:

  • Uses outer double quotes to delimit the phrase.
  • Escapes inner " as "".
  • Treats ";" as part of the token.

Common Pitfalls:

  • Mixing single and double quotes: '"has ";" space"' is invalid because the outer single quotes are SQL string delimiters, and the inner " are unescaped.
  • Forgetting to escape inner quotes: "has ";" space" (unescaped) breaks the phrase boundary after has .

Step 3: Handle Reserved Characters in Queries

Reserved characters like ; do not require escaping unless they conflict with query syntax. For example:

-- Search for 'no";"space'
SELECT * FROM my_fts WHERE text MATCH 'no";"space';

This works because the semicolon is part of the token ";" and does not interfere with the query parser. However, if the semicolon appears outside a quoted token, it may cause issues:

-- Problematic: Semicolon as a query operator
SELECT * FROM my_fts WHERE text MATCH 'has ; space';

Here, ; is interpreted as a query operator (unused in FTS5), leading to a parse error. To include reserved characters in unquoted terms, use:

-- Escape reserved characters with backslash (if enabled)
SELECT * FROM my_fts WHERE text MATCH 'has \; space';

Note: By default, FTS5 does not support backslash escaping. To enable it, configure the tokenizer with remove_diacritics 0:

CREATE VIRTUAL TABLE my_fts USING fts5(
  id,
  text,
  tokenize = "unicode61 separators ' ,' tokenchars ';'"
);

The tokenchars parameter specifies that ; should be treated as part of tokens, not separators or operators.

Step 4: Use Alternative Delimiters for Phrases

For scenarios where double quotes conflict with content, use column filters or NEAR operators:

-- Search for 'has ";" space' using column filter
SELECT * FROM my_fts WHERE text MATCH 'text:"has "";"" space"';

This explicitly scopes the phrase to the text column, reducing ambiguity.

Step 5: Reconfigure Tokenizer for Complex Characters

If reserved characters are prevalent in data, adjust the tokenizer to include them as token characters:

CREATE VIRTUAL TABLE my_fts USING fts5(
  id,
  text,
  tokenize = "unicode61 tokenchars ';' separators ','"
);

This configuration:

  • Treats ; as part of tokens.
  • Uses commas as separators.
  • Allows MATCH '";"' to find tokens containing semicolons.

Final Solution Summary

  1. For spaces in phrases: Always use double quotes around multi-word terms when spaces are not token separators.
  2. For reserved characters:
    • Enclose phrases in double quotes.
    • Escape inner double quotes by doubling them ("").
    • Use tokenchars to include reserved characters in tokens.
  3. For ambiguous queries: Use column filters (text:"phrase") or reconfigure the tokenizer to exclude problematic characters as separators.

By aligning tokenizer settings with query syntax requirements and rigorously escaping phrases, users can reliably search for terms containing spaces and reserved characters in FTS5.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *