Optimizing FTS5 Query Performance with Time-Range Constraints in SQLite Log Data
Understanding FTS5 Performance Degradation for Common Terms in Time-Bounded Log Queries
Issue Overview
The core challenge involves efficiently querying SQLite’s FTS5 virtual table for log data stored in JSON format when users apply free-text searches combined with time-range constraints. The schema uses a content-enabled FTS5 table (fts
) linked to a base table (events
), where source_values
(extracted JSON values) are indexed for full-text search. While rare terms are queried efficiently, common terms result in slow performance due to FTS5 scanning the entire index. This occurs even when queries are logically restricted to recent time ranges (e.g., "last hour" or "last day"). The problem intensifies with high data ingestion rates (~10 million records/day), causing FTS5 queries to underperform compared to LIKE
operations on the base table when timestamp ranges are applied. The goal is to reduce the volume of FTS5 data scanned without sacrificing the flexibility of free-text search.
Structural Limitations and Data Model Mismatches
- Unindexed Timestamp in FTS5: The
timestamp
column in the FTS5 table is marked asunindexed
, meaning FTS5 cannot natively optimize queries using time-range predicates. Even when aWHERE timestamp >= ...
clause is added, FTS5 must scan all rows matching the text pattern before applying the timestamp filter. - Rowid-Timestamp Correlation Assumption: The base table (
events
) uses an implicitly auto-incrementingrowid
, which is mirrored in the FTS5 table viacontent_rowid=rowid
. If log entries are inserted in chronological order,rowid
values correlate with insertion time and thus with thetimestamp
column. However, this relationship is not explicitly enforced or leveraged in the initial schema. - Content Table Overhead: The FTS5 table’s
content=events
directive creates a dependency on the base table for column values. While this reduces storage overhead, it introduces latency when resolving non-indexed columns liketimestamp
during query execution.
Hybrid Indexing Strategies and Query Rewrites
Step 1: Validate Rowid-Timestamp Correlation
Confirm that rowid
values in the events
table correspond to chronological insertion order. Execute:
SELECT rowid, timestamp FROM events ORDER BY rowid ASC LIMIT 1000;
If timestamps are non-decreasing with rowid
, proceed. If not, revise the data ingestion process to ensure chronological insertion.
Step 2: Create a Timestamp-to-Rowid Mapping Index
Add a covering index to the events
table to accelerate the translation of timestamp ranges to rowid
ranges:
CREATE INDEX idx_events_timestamp_rowid ON events(timestamp, rowid);
This index allows efficient lookups for the minimum/maximum rowid
values associated with a given timestamp range.
Step 3: Rewrite FTS5 Queries with Rowid Constraints
Replace timestamp-based WHERE
clauses with rowid
range constraints derived from the events
table. For example, to search for entries from the last hour:
SELECT *
FROM fts('<pattern>')
WHERE rowid >= (
SELECT MIN(rowid)
FROM events
WHERE timestamp >= (STRFTIME('%s', 'now') - 3600)
);
Explanation:
- The subquery
SELECT MIN(rowid) ...
uses theidx_events_timestamp_rowid
index to find the smallestrowid
corresponding to the start of the time range. - FTS5 restricts its search to
rowid >= ...
, leveraging its innate ability to optimize rowid range scans.
Step 4: Benchmark and Optimize Subquery Performance
If the subquery to map timestamps to rowid
ranges is slow, precompute boundary rowid
values using a temporary table or materialized view. For instance, maintain a time_windows
table that stores precalculated (start_time, end_time, min_rowid, max_rowid)
tuples, updated periodically via triggers or application logic.
Step 5: Partitioned FTS5 Tables (Advanced)
For extreme scalability, partition the FTS5 table by time intervals (e.g., daily). Create separate FTS5 tables (fts_20230301
, fts_20230302
, etc.) and union results across partitions during queries. This requires application-level logic to manage table creation and query routing.
Step 6: Contentless FTS5 Optimization
If storage overhead is acceptable, reconfigure the FTS5 table as contentless (content=''
) and include the timestamp
as an indexed column:
CREATE VIRTUAL TABLE fts USING fts5(timestamp, source_values, content='');
This allows direct timestamp
filtering within FTS5, but requires maintaining synchronization between the events
and fts
tables via triggers.
Edge Cases and Long-Term Maintenance
- Out-of-Order Inserts: If log entries arrive with non-chronological timestamps, the
rowid
-timestamp correlation breaks. Mitigate this by using a composite(timestamp, rowid)
index and adjusting subqueries to account for timestamp dispersion. - Data Retention Policies: Implement deletion routines that remove old data from both
events
andfts
tables to prevent uncontrolled growth. UseDELETE FROM events WHERE timestamp < ...
followed byINSERT INTO fts(fts, ...) VALUES('delete', ...)
for FTS5 integrity. - Query Planner Hints: In rare cases, SQLite’s query planner might not optimally combine FTS5 rowid ranges with subqueries. Use
NOT INDEXED
orINDEXED BY
clauses to force the desired execution plan.
By integrating rowid-based constraints with targeted indexing, FTS5 query performance for time-bounded searches can approach that of Elasticsearch for small-to-medium datasets, making SQLite a viable option for real-world deployments with 14–30 days of log data.