Optimizing SQLite Query Performance with Compound Indexes on Multiple Columns
Understanding Query Planner Behavior with Multi-Column Indexes in SQLite
Issue Overview: Inefficient Index Selection Despite Wide Compound Indexes
The problem arises when executing AND
-connected queries against a table with many queryable columns. A compound index covering all queryable columns (a
to z
in the example) was created to accelerate these queries. However, SQLite’s query planner (QP) does not automatically select this index unless explicitly forced via INDEXED BY
. Even after deleting all other indexes and running ANALYZE
, the QP still prefers full table scans or suboptimal index choices. This occurs despite the compound index being a covering index (i.e., containing all columns required by the query) and delivering significantly faster results when manually enforced.
Key observations from the scenario include:
- The compound index (
idx_treatments_allCols
) reduces query execution time from 7 seconds to 0.2 seconds when forced. - The QP ignores the compound index for queries filtering on columns that are not the first few columns in the index definition.
ANALYZE
statistics do not appear to influence the QP’s decision-making as expected.
This behavior contradicts the expectation that a covering index spanning all queryable columns would be prioritized by the QP. The root cause lies in how SQLite’s query planner evaluates index usability, selectivity, and cost.
Causes of Suboptimal Index Selection in Multi-Column Queries
1. Column Order in Compound Indexes
SQLite’s query planner relies heavily on the leftmost prefix rule when using compound indexes. An index on columns (a, b, c, d, ..., z)
can only be efficiently used if the query includes constraints on a contiguous leftmost subset of these columns. For example:
- A query with
WHERE a = ? AND b = ?
can use the index. - A query with
WHERE c = ? AND d = ?
cannot use the index unless the preceding columns (a
andb
) are also constrained or the index supports skip-scan optimization (which has limitations).
In the provided example, the query filtering on status
and checkinTime
did not use the compound index because these columns were not the leftmost in the index definition. The QP defaulted to a smaller index on status
(if present) or a full table scan.
2. Query Cost Estimation Errors
SQLite’s cost-based optimizer uses statistics from ANALYZE
to estimate the number of rows filtered by each index. However, these estimates can be inaccurate for compound indexes with many columns:
- False Selectivity Assumptions: The QP assumes that a compound index with many columns has lower selectivity for individual columns, especially if those columns appear later in the index. For example, if
status
is the 10th column in a 20-column index, the QP may underestimate its filtering power. - Covering Index Misjudgment: Even when a compound index is covering, the QP might incorrectly estimate that scanning the entire index is slower than scanning the table. This happens when the index is wide (contains many large columns) or when
ANALYZE
statistics indicate a high number of rows per index entry.
3. Skip-Scan Optimization Limitations
The skip-scan optimization allows SQLite to use an index even when the leftmost columns are not constrained, but it requires:
- Low cardinality (few distinct values) in the leading columns of the index.
- Up-to-date
ANALYZE
statistics to identify skip-scan opportunities.
In the example, if the columns preceding status
and checkinTime
in the index had high cardinality, skip-scan would be inefficient, and the QP would avoid it. This explains why forcing the index manually improved performance: the skip-scan was viable for the specific query’s data distribution but not generalizable.
Solutions for Ensuring Optimal Index Usage
Step 1: Reorganize Compound Indexes Based on Query Patterns
Action: Redesign the compound index to prioritize frequently queried columns and column groups. For example, if most queries filter on status
and checkinTime
, define the index as:
CREATE INDEX idx_treatments_priority ON treatments (status, checkinTime, ...);
Rationale: By placing high-selectivity and frequently used columns first, the index becomes usable for a broader range of queries without relying on skip-scan.
Implementation:
- Analyze query logs to identify common column combinations in
WHERE
clauses. - Use SQLite’s
EXPLAIN QUERY PLAN
to verify index usage after reordering.
Step 2: Validate and Update ANALYZE Statistics
Action: Ensure ANALYZE
generates accurate statistics by:
ANALYZE; -- Collect statistics for all indexes
ANALYZE sqlite_schema; -- Update schema metadata (SQLite 3.32+)
Rationale: Outdated statistics can cause the QP to misestimate the cost of using a compound index. For example, if ANALYZE
was run before populating the table, it would not reflect the true data distribution.
Verification:
SELECT * FROM sqlite_stat1 WHERE tbl = 'treatments';
Check the stat
column for histogram data. A value like 12345 10
indicates the index has 12,345 rows with an average of 10 rows per distinct value.
Step 3: Force Index Usage Strategically
Action: Use INDEXED BY
as a temporary workaround for critical queries:
SELECT ... FROM treatments INDEXED BY idx_treatments_allCols WHERE ...;
Rationale: This overrides the QP’s decision and is useful for:
- Queries where the index provides a proven performance benefit.
- Legacy systems where schema changes are not immediately feasible.
Caution: Overusing INDEXED BY
can lead to maintenance headaches and suboptimal plans if the data distribution changes. Reserve this for edge cases.
Step 4: Split Large Tables into Narrower Sub-Tables
Action: Separate frequently queried columns from rarely used ones:
CREATE TABLE treatments_core (
id INTEGER PRIMARY KEY,
status TEXT,
checkinTime INTEGER,
...
);
CREATE TABLE treatments_aux (
id INTEGER PRIMARY KEY,
large_data BLOB,
...
);
Rationale: Narrower tables reduce the cost of full scans and make covering indexes smaller and faster to scan. This is particularly effective if the original table contains BLOB
or TEXT
columns that inflate row sizes.
Step 5: Use Partial Indexes for High-Selectivity Filters
Action: Create indexes targeting specific values:
CREATE INDEX idx_treatments_status_nov ON treatments(status)
WHERE status = 'sp. nov.';
Rationale: Partial indexes are smaller and faster to scan when queries frequently filter on specific values (e.g., status = 'sp. nov.'
).
Step 6: Benchmark Index Scans vs. Table Scans
Action: Compare the performance of forced index scans with default QP behavior:
-- With index
SELECT Count(*) FROM treatments INDEXED BY idx_treatments_allCols WHERE ...;
-- Without index
SELECT Count(*) FROM treatments WHERE ...;
Rationale: If the index scan is consistently faster, consider reworking the schema or using INDEXED BY
. If not, the compound index may not be worth retaining.
Step 7: Monitor Query Plan Stability
Action: Regularly check query plans after data changes:
EXPLAIN QUERY PLAN SELECT ... FROM treatments WHERE ...;
Rationale: Data growth or shifts in value distributions can render previously optimal indexes ineffective. Proactive monitoring avoids sudden performance regressions.
Summary of Fixes and Best Practices
- Index Design: Prioritize leftmost columns based on query frequency and selectivity.
- Statistics Maintenance: Run
ANALYZE
after significant data changes. - Selective Overrides: Use
INDEXED BY
sparingly for mission-critical queries. - Schema Optimization: Split tables and use partial indexes to reduce overhead.
- Continuous Monitoring: Validate query plans and index usage periodically.
By aligning index structures with query patterns and maintaining accurate statistics, SQLite’s query planner can reliably select optimal indexes without manual intervention. For corner cases where the QP’s logic falls short, strategic use of INDEXED BY
and schema refactoring ensures consistent performance.