Optimizing FTS5 Query Performance with Per-Customer Indexing in SQLite
Issue Overview: Single FTS5 Index Causing Cross-Customer Query Inefficiency
The core challenge arises when a shared FTS5 index is used to store documents from multiple customers, leading to degraded query performance. The current query pattern involves filtering FTS results by customer_id
after retrieving a large subset of matches from the index. For example:
SELECT id, document, metadata
FROM my_main_table
WHERE customer_id = 'some_customer_id'
AND id IN (
SELECT id
FROM my_main_fts
WHERE document MATCH 'iphone or >13 or black'
ORDER BY rank
LIMIT 500
)
LIMIT 3;
This approach forces SQLite to scan the entire FTS5 index for matches across all customers, retrieve up to 500 candidate document IDs, and then filter those IDs against the customer_id
in the main table. The inefficiency stems from two factors:
- Unnecessary Index Scans: The FTS5 index contains documents from all customers, so even queries targeting a single customer must traverse irrelevant data.
- Post-Filtering Overhead: The
IN
clause andcustomer_id
filter operate after the FTS index scan, requiring intermediate result sets to be materialized and compared.
As the number of customers and documents grows, the FTS5 index becomes bloated, and query latency increases due to the compounding effects of these inefficiencies. The goal is to isolate FTS index segments per customer to reduce scan ranges and eliminate cross-customer data contamination during searches.
Possible Causes: Structural Limitations of FTS5 Index Design
The root causes of the performance degradation are tied to how FTS5 indexes are structured and queried:
Monolithic FTS5 Index Architecture:
By default, FTS5 creates a single inverted index for all documents in the virtual table. This design is optimal for global searches but becomes a liability when queries target subsets of documents (e.g., per-customer filtering). The index cannot natively partition data by metadata columns likecustomer_id
, leading to full or partial scans even for filtered queries.Query Pattern Mismatch:
The current query uses a subquery to retrieve FTS matches first, then filters those matches bycustomer_id
in the outer query. This forces SQLite to process two separate steps:- Retrieve all FTS matches (regardless of customer).
- Filter the matches against the
customer_id
in the main table.
The absence of a direct correlation between the FTS index and thecustomer_id
column prevents the query optimizer from pushing the customer filter into the FTS search phase.
LIMIT Clause Misuse:
TheLIMIT 500
in the subquery is a workaround to reduce the number of candidate IDs passed to the outercustomer_id
filter. However, this introduces two risks:- Underfetching: If the first 500 matches contain fewer than 3 valid customer documents, the query returns fewer results than expected.
- Ranking Inaccuracy: The
ORDER BY rank
clause may prioritize higher-ranked documents from other customers, pushing relevant customer documents beyond the 500-row cutoff.
Lack of Column-Level Filtering in FTS5:
While FTS5 supports column filters in match expressions (e.g.,column:value
), this feature is often underutilized. Without embeddingcustomer_id
directly in the FTS index, it’s impossible to leverage column filters to narrow the search scope during the index traversal phase.
Troubleshooting Steps, Solutions & Fixes: Partitioned Indexing Strategies
Solution 1: Integrate Customer_ID into the FTS5 Index
Step 1: Modify the FTS5 Table Schema
Redesign the FTS5 virtual table to include customer_id
as an indexed column. This allows the FTS5 index to store customer_id
values alongside document content, enabling column-specific filtering during the match phase.
-- Drop existing FTS5 table (ensure data backup first)
DROP TABLE my_main_fts;
-- Recreate FTS5 table with customer_id as a separate column
CREATE VIRTUAL TABLE my_main_fts USING fts5(
document,
customer_id UNINDEXED, -- Stores customer_id but does not tokenize it
content='my_main_table',
content_rowid='rowid'
);
Step 2: Rebuild Triggers for Data Synchronization
If the original FTS5 table was populated via triggers, update them to include customer_id
during insert/update/delete operations:
-- Example trigger for inserts
CREATE TRIGGER my_main_fts_ai AFTER INSERT ON my_main_table BEGIN
INSERT INTO my_main_fts (rowid, document, customer_id)
VALUES (new.rowid, new.document, new.customer_id);
END;
Step 3: Rewrite the Query to Leverage Column Filters
Modify the FTS5 match expression to include a customer_id
filter, ensuring the index scan is restricted to the target customer’s documents:
SELECT id, document, metadata
FROM my_main_table
WHERE id IN (
SELECT rowid
FROM my_main_fts
WHERE my_main_fts MATCH 'customer_id:' || 'some_customer_id' || ' AND (iphone or >13 or black)'
ORDER BY rank
LIMIT 3 -- Directly limit to desired results
);
Key Advantages:
- The FTS5 index scan now excludes documents from other customers, drastically reducing the search space.
- The
LIMIT 3
is applied within the subquery, minimizing the number of IDs passed to the outer query. - No need for multiple FTS5 tables, simplifying schema management.
Caveats:
- The
customer_id
column must be stored in the FTS5 table (viaUNINDEXED
) to avoid tokenization. - Queries must dynamically construct the
MATCH
expression to include thecustomer_id:
filter.
Solution 2: Per-Customer FTS5 Tables with Dynamic Schema Management
Step 1: Create Customer-Specific FTS5 Tables
For each customer, generate a dedicated FTS5 table that only indexes their documents. Use a naming convention like customer_fts_<customer_id>
:
-- Example for customer '123'
CREATE VIRTUAL TABLE customer_fts_123 USING fts5(
document,
content='my_main_table',
content_rowid='rowid',
prefix='2 3' -- Optional: Optimize for prefix queries
);
Step 2: Populate Customer FTS5 Tables Conditionally
Modify application logic or triggers to insert documents into the appropriate customer FTS5 table:
-- Trigger adjusted for conditional insertion
CREATE TRIGGER my_main_fts_ai_customer AFTER INSERT ON my_main_table
FOR EACH ROW
BEGIN
INSERT INTO customer_fts_123 (rowid, document)
SELECT new.rowid, new.document
WHERE new.customer_id = '123';
-- Repeat for other customers via dynamic SQL or application code
END;
Step 3: Query the Customer-Specific FTS5 Table Directly
Rewrite queries to target the customer’s FTS5 table, eliminating cross-customer scans entirely:
SELECT id, document, metadata
FROM my_main_table
WHERE id IN (
SELECT rowid
FROM customer_fts_123
WHERE document MATCH 'iphone or >13 or black'
ORDER BY rank
LIMIT 3
);
Key Advantages:
- Complete isolation of customer documents in separate FTS5 indexes.
- Queries are inherently scoped to a single customer, avoiding post-filtering.
- Index maintenance operations (e.g., rebuilds) can be performed per customer.
Caveats:
- Schema Proliferation: Managing hundreds or thousands of FTS5 tables complicates schema migrations and backups.
- Trigger Complexity: Triggers must conditionally route inserts/updates to the correct FTS5 table, which is challenging to automate.
- External Content Limitations: SQLite may incorrectly flag the database as corrupted if the
content
table (my_main_table
) is modified without updating the FTS5 tables.
Solution 3: Hybrid Approach with Filtered Materialized Views
Step 1: Create Materialized Views Per Customer
Use table triggers to maintain materialized views of each customer’s documents, then build FTS5 indexes on these views:
-- Materialized view for customer '123'
CREATE TABLE customer_123_docs (
rowid INTEGER PRIMARY KEY,
document TEXT,
metadata TEXT
);
-- Trigger to populate the materialized view
CREATE TRIGGER my_main_customer_123_ai AFTER INSERT ON my_main_table
FOR EACH ROW WHEN NEW.customer_id = '123'
BEGIN
INSERT INTO customer_123_docs (rowid, document, metadata)
VALUES (NEW.rowid, NEW.document, NEW.metadata);
END;
-- FTS5 index on the materialized view
CREATE VIRTUAL TABLE customer_fts_123 USING fts5(document, content='customer_123_docs');
Step 2: Query the Materialized View’s FTS5 Index
SELECT rowid, document, metadata
FROM customer_123_docs
WHERE rowid IN (
SELECT rowid
FROM customer_fts_123
WHERE document MATCH 'iphone or >13 or black'
ORDER BY rank
LIMIT 3
);
Key Advantages:
- Decouples the FTS5 index from the main table, enabling isolated per-customer indexing.
- Simplifies backups and migrations for individual customer data.
Caveats:
- Storage Overhead: Materialized views duplicate document and metadata content.
- Trigger Maintenance: Each customer requires a dedicated set of triggers.
Final Recommendations
Benchmark Column-Filtered vs. Per-Customer Indexes:
Test both approaches with representative data volumes to quantify performance gains. Column-filtered indexes are easier to implement but may still incur overhead for very large customer datasets.Leverage SQLite’s Full-Text Query Syntax:
Use advanced FTS5 features like phrase searches (NEAR
) and Boolean operators to refine match accuracy, reducing the number of candidates requiring post-filtering.Monitor Index Fragmentation:
Regularly runINSERT INTO my_main_fts(my_main_fts) VALUES('optimize')
to rebuild fragmented FTS5 indexes, especially for write-heavy workloads.Consider Sharding by Customer:
For extreme scalability, partition the entire database by customer, with each shard containing its own FTS5 index. This aligns with SQLite’s “one-file-per-database” model but complicates multi-customer queries.
By systematically evaluating these strategies, you can eliminate cross-customer FTS index scans, reduce query latency, and scale your application efficiently.