Optimizing FTS5 Query Performance with Per-Customer Indexing in SQLite

Issue Overview: Single FTS5 Index Causing Cross-Customer Query Inefficiency

The core challenge arises when a shared FTS5 index is used to store documents from multiple customers, leading to degraded query performance. The current query pattern involves filtering FTS results by customer_id after retrieving a large subset of matches from the index. For example:

SELECT id, document, metadata  
FROM my_main_table  
WHERE customer_id = 'some_customer_id'  
AND id IN (  
  SELECT id  
  FROM my_main_fts  
  WHERE document MATCH 'iphone or >13 or black'  
  ORDER BY rank  
  LIMIT 500  
)  
LIMIT 3;  

This approach forces SQLite to scan the entire FTS5 index for matches across all customers, retrieve up to 500 candidate document IDs, and then filter those IDs against the customer_id in the main table. The inefficiency stems from two factors:

  1. Unnecessary Index Scans: The FTS5 index contains documents from all customers, so even queries targeting a single customer must traverse irrelevant data.
  2. Post-Filtering Overhead: The IN clause and customer_id filter operate after the FTS index scan, requiring intermediate result sets to be materialized and compared.

As the number of customers and documents grows, the FTS5 index becomes bloated, and query latency increases due to the compounding effects of these inefficiencies. The goal is to isolate FTS index segments per customer to reduce scan ranges and eliminate cross-customer data contamination during searches.

Possible Causes: Structural Limitations of FTS5 Index Design

The root causes of the performance degradation are tied to how FTS5 indexes are structured and queried:

  1. Monolithic FTS5 Index Architecture:
    By default, FTS5 creates a single inverted index for all documents in the virtual table. This design is optimal for global searches but becomes a liability when queries target subsets of documents (e.g., per-customer filtering). The index cannot natively partition data by metadata columns like customer_id, leading to full or partial scans even for filtered queries.

  2. Query Pattern Mismatch:
    The current query uses a subquery to retrieve FTS matches first, then filters those matches by customer_id in the outer query. This forces SQLite to process two separate steps:

    • Retrieve all FTS matches (regardless of customer).
    • Filter the matches against the customer_id in the main table.
      The absence of a direct correlation between the FTS index and the customer_id column prevents the query optimizer from pushing the customer filter into the FTS search phase.
  3. LIMIT Clause Misuse:
    The LIMIT 500 in the subquery is a workaround to reduce the number of candidate IDs passed to the outer customer_id filter. However, this introduces two risks:

    • Underfetching: If the first 500 matches contain fewer than 3 valid customer documents, the query returns fewer results than expected.
    • Ranking Inaccuracy: The ORDER BY rank clause may prioritize higher-ranked documents from other customers, pushing relevant customer documents beyond the 500-row cutoff.
  4. Lack of Column-Level Filtering in FTS5:
    While FTS5 supports column filters in match expressions (e.g., column:value), this feature is often underutilized. Without embedding customer_id directly in the FTS index, it’s impossible to leverage column filters to narrow the search scope during the index traversal phase.

Troubleshooting Steps, Solutions & Fixes: Partitioned Indexing Strategies

Solution 1: Integrate Customer_ID into the FTS5 Index

Step 1: Modify the FTS5 Table Schema
Redesign the FTS5 virtual table to include customer_id as an indexed column. This allows the FTS5 index to store customer_id values alongside document content, enabling column-specific filtering during the match phase.

-- Drop existing FTS5 table (ensure data backup first)  
DROP TABLE my_main_fts;  

-- Recreate FTS5 table with customer_id as a separate column  
CREATE VIRTUAL TABLE my_main_fts USING fts5(  
  document,  
  customer_id UNINDEXED,  -- Stores customer_id but does not tokenize it  
  content='my_main_table',  
  content_rowid='rowid'  
);  

Step 2: Rebuild Triggers for Data Synchronization
If the original FTS5 table was populated via triggers, update them to include customer_id during insert/update/delete operations:

-- Example trigger for inserts  
CREATE TRIGGER my_main_fts_ai AFTER INSERT ON my_main_table BEGIN  
  INSERT INTO my_main_fts (rowid, document, customer_id)  
  VALUES (new.rowid, new.document, new.customer_id);  
END;  

Step 3: Rewrite the Query to Leverage Column Filters
Modify the FTS5 match expression to include a customer_id filter, ensuring the index scan is restricted to the target customer’s documents:

SELECT id, document, metadata  
FROM my_main_table  
WHERE id IN (  
  SELECT rowid  
  FROM my_main_fts  
  WHERE my_main_fts MATCH 'customer_id:' || 'some_customer_id' || ' AND (iphone or >13 or black)'  
  ORDER BY rank  
  LIMIT 3  -- Directly limit to desired results  
);  

Key Advantages:

  • The FTS5 index scan now excludes documents from other customers, drastically reducing the search space.
  • The LIMIT 3 is applied within the subquery, minimizing the number of IDs passed to the outer query.
  • No need for multiple FTS5 tables, simplifying schema management.

Caveats:

  • The customer_id column must be stored in the FTS5 table (via UNINDEXED) to avoid tokenization.
  • Queries must dynamically construct the MATCH expression to include the customer_id: filter.

Solution 2: Per-Customer FTS5 Tables with Dynamic Schema Management

Step 1: Create Customer-Specific FTS5 Tables
For each customer, generate a dedicated FTS5 table that only indexes their documents. Use a naming convention like customer_fts_<customer_id>:

-- Example for customer '123'  
CREATE VIRTUAL TABLE customer_fts_123 USING fts5(  
  document,  
  content='my_main_table',  
  content_rowid='rowid',  
  prefix='2 3'  -- Optional: Optimize for prefix queries  
);  

Step 2: Populate Customer FTS5 Tables Conditionally
Modify application logic or triggers to insert documents into the appropriate customer FTS5 table:

-- Trigger adjusted for conditional insertion  
CREATE TRIGGER my_main_fts_ai_customer AFTER INSERT ON my_main_table  
FOR EACH ROW  
BEGIN  
  INSERT INTO customer_fts_123 (rowid, document)  
  SELECT new.rowid, new.document  
  WHERE new.customer_id = '123';  
  -- Repeat for other customers via dynamic SQL or application code  
END;  

Step 3: Query the Customer-Specific FTS5 Table Directly
Rewrite queries to target the customer’s FTS5 table, eliminating cross-customer scans entirely:

SELECT id, document, metadata  
FROM my_main_table  
WHERE id IN (  
  SELECT rowid  
  FROM customer_fts_123  
  WHERE document MATCH 'iphone or >13 or black'  
  ORDER BY rank  
  LIMIT 3  
);  

Key Advantages:

  • Complete isolation of customer documents in separate FTS5 indexes.
  • Queries are inherently scoped to a single customer, avoiding post-filtering.
  • Index maintenance operations (e.g., rebuilds) can be performed per customer.

Caveats:

  • Schema Proliferation: Managing hundreds or thousands of FTS5 tables complicates schema migrations and backups.
  • Trigger Complexity: Triggers must conditionally route inserts/updates to the correct FTS5 table, which is challenging to automate.
  • External Content Limitations: SQLite may incorrectly flag the database as corrupted if the content table (my_main_table) is modified without updating the FTS5 tables.

Solution 3: Hybrid Approach with Filtered Materialized Views

Step 1: Create Materialized Views Per Customer
Use table triggers to maintain materialized views of each customer’s documents, then build FTS5 indexes on these views:

-- Materialized view for customer '123'  
CREATE TABLE customer_123_docs (  
  rowid INTEGER PRIMARY KEY,  
  document TEXT,  
  metadata TEXT  
);  

-- Trigger to populate the materialized view  
CREATE TRIGGER my_main_customer_123_ai AFTER INSERT ON my_main_table  
FOR EACH ROW WHEN NEW.customer_id = '123'  
BEGIN  
  INSERT INTO customer_123_docs (rowid, document, metadata)  
  VALUES (NEW.rowid, NEW.document, NEW.metadata);  
END;  

-- FTS5 index on the materialized view  
CREATE VIRTUAL TABLE customer_fts_123 USING fts5(document, content='customer_123_docs');  

Step 2: Query the Materialized View’s FTS5 Index

SELECT rowid, document, metadata  
FROM customer_123_docs  
WHERE rowid IN (  
  SELECT rowid  
  FROM customer_fts_123  
  WHERE document MATCH 'iphone or >13 or black'  
  ORDER BY rank  
  LIMIT 3  
);  

Key Advantages:

  • Decouples the FTS5 index from the main table, enabling isolated per-customer indexing.
  • Simplifies backups and migrations for individual customer data.

Caveats:

  • Storage Overhead: Materialized views duplicate document and metadata content.
  • Trigger Maintenance: Each customer requires a dedicated set of triggers.

Final Recommendations

  1. Benchmark Column-Filtered vs. Per-Customer Indexes:
    Test both approaches with representative data volumes to quantify performance gains. Column-filtered indexes are easier to implement but may still incur overhead for very large customer datasets.

  2. Leverage SQLite’s Full-Text Query Syntax:
    Use advanced FTS5 features like phrase searches (NEAR) and Boolean operators to refine match accuracy, reducing the number of candidates requiring post-filtering.

  3. Monitor Index Fragmentation:
    Regularly run INSERT INTO my_main_fts(my_main_fts) VALUES('optimize') to rebuild fragmented FTS5 indexes, especially for write-heavy workloads.

  4. Consider Sharding by Customer:
    For extreme scalability, partition the entire database by customer, with each shard containing its own FTS5 index. This aligns with SQLite’s “one-file-per-database” model but complicates multi-customer queries.

By systematically evaluating these strategies, you can eliminate cross-customer FTS index scans, reduce query latency, and scale your application efficiently.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *