Resolving FTS5 Query Syntax, Tokenization, and Indexing Issues in Hybrid Search Systems


Unexpected Query Failures with Hyphenated Terms and OR Operators in FTS5

Core Observations

The user encountered inconsistent behavior when querying an FTS5 virtual table with hyphenated terms and logical operators. Specific issues include:

  1. soft-cover failing with a "no such column: cover" error.
  2. "soft-cover" succeeding despite the hyphen.
  3. "well-ness" failing to match the indexed term "wellness."
  4. "nourish OR soft-cover" failing even though individual terms worked.

Root Causes

Hyphens as Column Filter Operators

FTS5 interprets unquoted hyphens (-) as column exclusion operators. For example, soft-cover is parsed as:

  • Search for soft in all indexed columns except a (non-existent) column named cover.

Since the products FTS5 table has no column named cover, this triggers a "no such column" error. Quoting the term ("soft-cover") forces FTS5 to treat it as a single phrase, bypassing column filter interpretation.

Tokenization of Hyphenated Phrases

FTS5’s default tokenizer splits text on hyphens, underscores, and Unicode whitespace. The phrase "soft-cover" is tokenized into soft and cover, which match the indexed title "SOFT COVER" (due to case insensitivity). However, "well-ness" is tokenized into well and ness, which do not appear in the original title "WELLNESS PLANNER."

OR Operator Precedence and Syntax

The query nourish OR soft-cover fails because:

  1. soft-cover is parsed as soft (all columns except cover).
  2. The OR operator requires both operands to be valid.
  3. The invalid column filter cover propagates the error.

Quoting the hyphenated term ("nourish OR soft-cover") does not resolve the issue because the entire quoted string is treated as a single phrase. Instead, the correct syntax is nourish OR "soft-cover", which separates the OR operator from the quoted phrase.


Discrepancies in Tokenization Behavior for Hyphenated Search Phrases

Core Observations

  1. "soft-cover" matches "SOFT COVER" despite the absence of a hyphen in the original title.
  2. "well-ness" fails to match "WELLNESS PLANNER."

Root Causes

FTS5 Tokenization Rules

The default Unicode61 tokenizer performs the following actions:

  1. Case folding: All text is converted to lowercase.
  2. Diacritic removal: Accented characters are replaced with their base forms.
  3. Separator splitting: Text is split at hyphens, underscores, and whitespace.

For the title "NOURISH – WELLNESS PLANNER – UNDATED – SOFT COVER":

  • Hyphens are treated as separators.
  • The tokenized terms are nourish, wellness, planner, undated, soft, cover.

Query Tokenization vs. Indexed Data

When searching for "soft-cover":

  • The phrase is tokenized into soft and cover.
  • These tokens exist in the indexed title.

When searching for "well-ness":

  • The phrase is tokenized into well and ness.
  • Neither token exists in the indexed title ("wellness" is tokenized as wellness).

Case Insensitivity and Partial Matching

FTS5 does not natively support stemming or partial matching. The mismatch between well-ness (tokenized as well + ness) and wellness (tokenized as wellness) explains the failed query.


Optimizing FTS5 Queries with Non-Indexed Columns and Partitioning Strategies

Core Observations

  1. The store_id is declared as UNINDEXED but used in the WHERE clause.
  2. The user seeks to "partition" the FTS5 table by store_id for performance.

Root Causes

Role of UNINDEXED Columns

The UNINDEXED keyword in FTS5:

  • Stores the column’s value alongside the full-text index.
  • Excludes the column from text indexing (it cannot be searched via MATCH).

Queries with store_id=100 in the WHERE clause:

  1. Perform a full-text search using title MATCH ?.
  2. Filter the results using store_id=100.

Performance Implications

  • Without an index on store_id, SQLite performs a linear scan of the FTS5 table’s store_id values.
  • Adding a standard SQLite index on store_id is not possible because FTS5 virtual tables do not support traditional indexes.

Partitioning Strategies

SQLite does not support native partitioning. However, two workarounds exist:

  1. Per-Store FTS5 Tables: Create separate FTS5 tables for each store_id (e.g., products_store_100, products_store_101).
    • Pros: Isolates searches to a single table.
    • Cons: Increases schema complexity and query fragmentation.
  2. External Content Tables: Use a regular table to store store_id and other metadata, joined with the FTS5 table via rowid.
    • Example:
      CREATE TABLE product_metadata (  
          rowid INTEGER PRIMARY KEY,  
          store_id INTEGER  
      );  
      CREATE VIRTUAL TABLE products_fts USING fts5(title);  
      
    • Query using a join:
      SELECT title  
      FROM product_metadata  
      JOIN products_fts ON product_metadata.rowid = products_fts.rowid  
      WHERE store_id = 100 AND title MATCH 'soft';  
      
    • Pros: Allows standard indexing on store_id.

Solutions and Fixes

Query Syntax Fixes

  1. Hyphenated Terms:
    • Always enclose hyphenated phrases in double quotes: title MATCH '"soft-cover"'.
  2. Logical Operators:
    • Use parentheses and explicit quoting: title MATCH 'nourish OR "soft-cover"'.

Tokenization Adjustments

  1. Custom Tokenizers:
    • Implement a tokenizer that retains hyphens or merges specific terms.
    • Example using the Porter tokenizer:
      CREATE VIRTUAL TABLE products USING fts5(title, tokenize='porter unicode61');  
      
  2. Preprocessing:
    • Normalize hyphens in input data (e.g., replace well-ness with wellness).

Store_ID Filtering Optimization

  1. External Table with Index:
    • Store store_id in a regular table with an index:
      CREATE TABLE product_stores (  
          rowid INTEGER PRIMARY KEY,  
          store_id INTEGER  
      );  
      CREATE INDEX idx_store_id ON product_stores(store_id);  
      
    • Query using a join:
      SELECT title  
      FROM product_stores  
      JOIN products_fts ON product_stores.rowid = products_fts.rowid  
      WHERE store_id = 100 AND title MATCH 'soft';  
      
  2. Materialized Views:
    • Use a view to combine FTS5 and metadata:
      CREATE VIEW products_view AS  
      SELECT title, store_id  
      FROM product_stores  
      JOIN products_fts ON product_stores.rowid = products_fts.rowid;  
      

Partitioning Workarounds

  1. Dynamic Table Selection:
    • Use application logic to route queries to store-specific FTS5 tables:
      def query_store(store_id, search_term):  
          table_name = f'products_store_{store_id}'  
          query = f'SELECT * FROM {table_name} WHERE title MATCH ?'  
          return conn.execute(query, (search_term,)).fetchall()  
      
  2. Sharding via Attached Databases:
    • Attach per-store databases dynamically:
      ATTACH DATABASE 'store_100.db' AS store100;  
      SELECT * FROM store100.products WHERE title MATCH 'soft';  
      

This guide systematically addresses syntax errors, tokenization nuances, and indexing strategies to optimize FTS5 usage in hybrid search systems.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *