Optimizing FTS5 Query Performance with Time-Range Constraints in SQLite Log Data

Understanding FTS5 Performance Degradation for Common Terms in Time-Bounded Log Queries

Issue Overview

The core challenge involves efficiently querying SQLite’s FTS5 virtual table for log data stored in JSON format when users apply free-text searches combined with time-range constraints. The schema uses a content-enabled FTS5 table (fts) linked to a base table (events), where source_values (extracted JSON values) are indexed for full-text search. While rare terms are queried efficiently, common terms result in slow performance due to FTS5 scanning the entire index. This occurs even when queries are logically restricted to recent time ranges (e.g., "last hour" or "last day"). The problem intensifies with high data ingestion rates (~10 million records/day), causing FTS5 queries to underperform compared to LIKE operations on the base table when timestamp ranges are applied. The goal is to reduce the volume of FTS5 data scanned without sacrificing the flexibility of free-text search.

Structural Limitations and Data Model Mismatches

  1. Unindexed Timestamp in FTS5: The timestamp column in the FTS5 table is marked as unindexed, meaning FTS5 cannot natively optimize queries using time-range predicates. Even when a WHERE timestamp >= ... clause is added, FTS5 must scan all rows matching the text pattern before applying the timestamp filter.
  2. Rowid-Timestamp Correlation Assumption: The base table (events) uses an implicitly auto-incrementing rowid, which is mirrored in the FTS5 table via content_rowid=rowid. If log entries are inserted in chronological order, rowid values correlate with insertion time and thus with the timestamp column. However, this relationship is not explicitly enforced or leveraged in the initial schema.
  3. Content Table Overhead: The FTS5 table’s content=events directive creates a dependency on the base table for column values. While this reduces storage overhead, it introduces latency when resolving non-indexed columns like timestamp during query execution.

Hybrid Indexing Strategies and Query Rewrites

Step 1: Validate Rowid-Timestamp Correlation

Confirm that rowid values in the events table correspond to chronological insertion order. Execute:

SELECT rowid, timestamp FROM events ORDER BY rowid ASC LIMIT 1000;

If timestamps are non-decreasing with rowid, proceed. If not, revise the data ingestion process to ensure chronological insertion.

Step 2: Create a Timestamp-to-Rowid Mapping Index

Add a covering index to the events table to accelerate the translation of timestamp ranges to rowid ranges:

CREATE INDEX idx_events_timestamp_rowid ON events(timestamp, rowid);

This index allows efficient lookups for the minimum/maximum rowid values associated with a given timestamp range.

Step 3: Rewrite FTS5 Queries with Rowid Constraints

Replace timestamp-based WHERE clauses with rowid range constraints derived from the events table. For example, to search for entries from the last hour:

SELECT *
FROM fts('<pattern>')
WHERE rowid >= (
  SELECT MIN(rowid) 
  FROM events 
  WHERE timestamp >= (STRFTIME('%s', 'now') - 3600)
);

Explanation:

  • The subquery SELECT MIN(rowid) ... uses the idx_events_timestamp_rowid index to find the smallest rowid corresponding to the start of the time range.
  • FTS5 restricts its search to rowid >= ..., leveraging its innate ability to optimize rowid range scans.

Step 4: Benchmark and Optimize Subquery Performance

If the subquery to map timestamps to rowid ranges is slow, precompute boundary rowid values using a temporary table or materialized view. For instance, maintain a time_windows table that stores precalculated (start_time, end_time, min_rowid, max_rowid) tuples, updated periodically via triggers or application logic.

Step 5: Partitioned FTS5 Tables (Advanced)

For extreme scalability, partition the FTS5 table by time intervals (e.g., daily). Create separate FTS5 tables (fts_20230301, fts_20230302, etc.) and union results across partitions during queries. This requires application-level logic to manage table creation and query routing.

Step 6: Contentless FTS5 Optimization

If storage overhead is acceptable, reconfigure the FTS5 table as contentless (content='') and include the timestamp as an indexed column:

CREATE VIRTUAL TABLE fts USING fts5(timestamp, source_values, content='');

This allows direct timestamp filtering within FTS5, but requires maintaining synchronization between the events and fts tables via triggers.

Edge Cases and Long-Term Maintenance

  • Out-of-Order Inserts: If log entries arrive with non-chronological timestamps, the rowid-timestamp correlation breaks. Mitigate this by using a composite (timestamp, rowid) index and adjusting subqueries to account for timestamp dispersion.
  • Data Retention Policies: Implement deletion routines that remove old data from both events and fts tables to prevent uncontrolled growth. Use DELETE FROM events WHERE timestamp < ... followed by INSERT INTO fts(fts, ...) VALUES('delete', ...) for FTS5 integrity.
  • Query Planner Hints: In rare cases, SQLite’s query planner might not optimally combine FTS5 rowid ranges with subqueries. Use NOT INDEXED or INDEXED BY clauses to force the desired execution plan.

By integrating rowid-based constraints with targeted indexing, FTS5 query performance for time-bounded searches can approach that of Elasticsearch for small-to-medium datasets, making SQLite a viable option for real-world deployments with 14–30 days of log data.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *