Resolving FTS5 Query Syntax, Tokenization, and Indexing Issues in Hybrid Search Systems
Unexpected Query Failures with Hyphenated Terms and OR Operators in FTS5
Core Observations
The user encountered inconsistent behavior when querying an FTS5 virtual table with hyphenated terms and logical operators. Specific issues include:
soft-cover
failing with a "no such column: cover" error."soft-cover"
succeeding despite the hyphen."well-ness"
failing to match the indexed term "wellness.""nourish OR soft-cover"
failing even though individual terms worked.
Root Causes
Hyphens as Column Filter Operators
FTS5 interprets unquoted hyphens (-
) as column exclusion operators. For example, soft-cover
is parsed as:
- Search for
soft
in all indexed columns except a (non-existent) column namedcover
.
Since the products
FTS5 table has no column named cover
, this triggers a "no such column" error. Quoting the term ("soft-cover"
) forces FTS5 to treat it as a single phrase, bypassing column filter interpretation.
Tokenization of Hyphenated Phrases
FTS5’s default tokenizer splits text on hyphens, underscores, and Unicode whitespace. The phrase "soft-cover"
is tokenized into soft
and cover
, which match the indexed title "SOFT COVER" (due to case insensitivity). However, "well-ness"
is tokenized into well
and ness
, which do not appear in the original title "WELLNESS PLANNER."
OR Operator Precedence and Syntax
The query nourish OR soft-cover
fails because:
soft-cover
is parsed assoft
(all columns exceptcover
).- The
OR
operator requires both operands to be valid. - The invalid column filter
cover
propagates the error.
Quoting the hyphenated term ("nourish OR soft-cover"
) does not resolve the issue because the entire quoted string is treated as a single phrase. Instead, the correct syntax is nourish OR "soft-cover"
, which separates the OR
operator from the quoted phrase.
Discrepancies in Tokenization Behavior for Hyphenated Search Phrases
Core Observations
"soft-cover"
matches "SOFT COVER" despite the absence of a hyphen in the original title."well-ness"
fails to match "WELLNESS PLANNER."
Root Causes
FTS5 Tokenization Rules
The default Unicode61 tokenizer performs the following actions:
- Case folding: All text is converted to lowercase.
- Diacritic removal: Accented characters are replaced with their base forms.
- Separator splitting: Text is split at hyphens, underscores, and whitespace.
For the title "NOURISH – WELLNESS PLANNER – UNDATED – SOFT COVER":
- Hyphens are treated as separators.
- The tokenized terms are
nourish
,wellness
,planner
,undated
,soft
,cover
.
Query Tokenization vs. Indexed Data
When searching for "soft-cover"
:
- The phrase is tokenized into
soft
andcover
. - These tokens exist in the indexed title.
When searching for "well-ness"
:
- The phrase is tokenized into
well
andness
. - Neither token exists in the indexed title ("wellness" is tokenized as
wellness
).
Case Insensitivity and Partial Matching
FTS5 does not natively support stemming or partial matching. The mismatch between well-ness
(tokenized as well
+ ness
) and wellness
(tokenized as wellness
) explains the failed query.
Optimizing FTS5 Queries with Non-Indexed Columns and Partitioning Strategies
Core Observations
- The
store_id
is declared asUNINDEXED
but used in theWHERE
clause. - The user seeks to "partition" the FTS5 table by
store_id
for performance.
Root Causes
Role of UNINDEXED Columns
The UNINDEXED
keyword in FTS5:
- Stores the column’s value alongside the full-text index.
- Excludes the column from text indexing (it cannot be searched via
MATCH
).
Queries with store_id=100
in the WHERE
clause:
- Perform a full-text search using
title MATCH ?
. - Filter the results using
store_id=100
.
Performance Implications
- Without an index on
store_id
, SQLite performs a linear scan of the FTS5 table’sstore_id
values. - Adding a standard SQLite index on
store_id
is not possible because FTS5 virtual tables do not support traditional indexes.
Partitioning Strategies
SQLite does not support native partitioning. However, two workarounds exist:
- Per-Store FTS5 Tables: Create separate FTS5 tables for each
store_id
(e.g.,products_store_100
,products_store_101
).- Pros: Isolates searches to a single table.
- Cons: Increases schema complexity and query fragmentation.
- External Content Tables: Use a regular table to store
store_id
and other metadata, joined with the FTS5 table viarowid
.- Example:
CREATE TABLE product_metadata ( rowid INTEGER PRIMARY KEY, store_id INTEGER ); CREATE VIRTUAL TABLE products_fts USING fts5(title);
- Query using a join:
SELECT title FROM product_metadata JOIN products_fts ON product_metadata.rowid = products_fts.rowid WHERE store_id = 100 AND title MATCH 'soft';
- Pros: Allows standard indexing on
store_id
.
- Example:
Solutions and Fixes
Query Syntax Fixes
- Hyphenated Terms:
- Always enclose hyphenated phrases in double quotes:
title MATCH '"soft-cover"'
.
- Always enclose hyphenated phrases in double quotes:
- Logical Operators:
- Use parentheses and explicit quoting:
title MATCH 'nourish OR "soft-cover"'
.
- Use parentheses and explicit quoting:
Tokenization Adjustments
- Custom Tokenizers:
- Implement a tokenizer that retains hyphens or merges specific terms.
- Example using the Porter tokenizer:
CREATE VIRTUAL TABLE products USING fts5(title, tokenize='porter unicode61');
- Preprocessing:
- Normalize hyphens in input data (e.g., replace
well-ness
withwellness
).
- Normalize hyphens in input data (e.g., replace
Store_ID Filtering Optimization
- External Table with Index:
- Store
store_id
in a regular table with an index:CREATE TABLE product_stores ( rowid INTEGER PRIMARY KEY, store_id INTEGER ); CREATE INDEX idx_store_id ON product_stores(store_id);
- Query using a join:
SELECT title FROM product_stores JOIN products_fts ON product_stores.rowid = products_fts.rowid WHERE store_id = 100 AND title MATCH 'soft';
- Store
- Materialized Views:
- Use a view to combine FTS5 and metadata:
CREATE VIEW products_view AS SELECT title, store_id FROM product_stores JOIN products_fts ON product_stores.rowid = products_fts.rowid;
- Use a view to combine FTS5 and metadata:
Partitioning Workarounds
- Dynamic Table Selection:
- Use application logic to route queries to store-specific FTS5 tables:
def query_store(store_id, search_term): table_name = f'products_store_{store_id}' query = f'SELECT * FROM {table_name} WHERE title MATCH ?' return conn.execute(query, (search_term,)).fetchall()
- Use application logic to route queries to store-specific FTS5 tables:
- Sharding via Attached Databases:
- Attach per-store databases dynamically:
ATTACH DATABASE 'store_100.db' AS store100; SELECT * FROM store100.products WHERE title MATCH 'soft';
- Attach per-store databases dynamically:
This guide systematically addresses syntax errors, tokenization nuances, and indexing strategies to optimize FTS5 usage in hybrid search systems.