Query Performance Regression in SQLite 3.44.2 Due to Suboptimal Index Selection
Issue Overview: Degraded Query Performance from Index Selection Changes in SQLite 3.44.2
A user reported a severe performance regression after upgrading from SQLite 3.42.0 to 3.44.2. A query that previously executed in seconds began taking 20 minutes. The root cause was traced to a change in the query planner’s index selection strategy. In version 3.42.0, the query used a composite index (attributes_data_parent_id_and_name_index
) to efficiently resolve a join condition involving data_parent_id
and name
. After upgrading to 3.44.2, the query planner switched to using a less optimal single-column index (attributes_name_index
), resulting in a full scan of intermediate results and drastic slowdowns.
The user identified a specific optimization removed in commit 796a65fa61373b5d
as the catalyst for this regression. Reverting this optimization locally restored the original query plan and performance. The core issue revolves around how the query planner evaluates candidate indexes for joins involving OR conditions and composite predicates. The regression highlights a scenario where newer heuristics in the query planner inadvertently prioritize an index that does not optimally reduce the search space for the given query structure.
The test case provided demonstrates the problem:
- A schema with four indexes on the
attributes
table. - A query that joins two subqueries, one filtering by
id
and the other byname
, with a WHERE clause involvingdata_parent_id
anddata_class_id
linked via OR. - In SQLite 3.42.0, the composite index
attributes_data_parent_id_and_name_index
is used for thedbAttrs3
subquery, enabling efficient lookups. - In SQLite 3.44.2, the single-column
attributes_name_index
is selected instead, forcing a scan of all rows matchingname = 'asdf'
and requiring subsequent filtering fordata_parent_id
ordata_class_id
.
This change in index selection directly impacts execution time because the composite index allows the database to resolve both name
and data_parent_id
in a single seek, while the single-column index requires separate processing of the OR conditions.
Possible Causes: Why the Query Planner Chooses a Suboptimal Index
1. Changes to the Query Planner’s Cost Model in SQLite 3.44.2
The commit 796a65fa61373b5d
removed an optimization related to index selection for queries with OR-connected predicates. This optimization likely influenced how the planner weighed the cost of using composite indexes versus single-column indexes. Without it, the planner may undervalue composite indexes when evaluating OR clauses, even if they cover multiple columns referenced in the query.
The cost model in SQLite’s query planner relies on estimates of disk I/O and CPU operations. Composite indexes can reduce the number of rows processed by resolving multiple predicates at once. However, if the planner’s heuristics assign a higher cost to using composite indexes for certain query patterns (e.g., OR conditions spanning multiple tables), it may fall back to single-column indexes, leading to suboptimal plans.
2. Outdated or Missing Statistical Metadata (ANALYZE Data)
SQLite uses stored statistics in the sqlite_stat1
table to estimate the selectivity of indexes. If these statistics are missing or outdated, the query planner cannot accurately compare the efficiency of candidate indexes. For example:
- The composite index
attributes_data_parent_id_and_name_index
might have better selectivity (fewer rows per(name, data_parent_id)
pair) than the single-columnattributes_name_index
. - If the
sqlite_stat1
table does not reflect this, the planner might assume both indexes have similar selectivity and choose the one with a lower perceived cost (e.g., smaller index size).
Running the ANALYZE
command regenerates these statistics. The absence of up-to-date statistics is a common cause of sudden performance regressions after upgrades, as newer planner versions may rely more heavily on statistical data.
3. OR Condition Handling and Join Reordering
The WHERE clause dbAttrs3.data_parent_id = dbAttrs2.data_parent_id OR dbAttrs3.data_parent_id = dbAttrs2.data_class_id
complicates query planning. SQLite must evaluate two potential join conditions and decide whether to process them as separate searches or as a combined scan.
In SQLite 3.42.0, the planner may have split the OR into separate queries and used the composite index to resolve both branches efficiently. In 3.44.2, the planner might instead scan all rows matching name = 'asdf'
and then filter them against the OR condition. This approach increases I/O and CPU load, especially if many rows match the name
predicate but few satisfy the data_parent_id
/data_class_id
conditions.
Troubleshooting Steps, Solutions & Fixes
1. Update Statistical Metadata with ANALYZE
Step 1: Run ANALYZE
in the SQLite CLI to regenerate table and index statistics:
ANALYZE;
This populates the sqlite_stat1
table with data such as the number of rows per index entry, which the query planner uses to estimate costs.
Step 2: Compare the query plans before and after running ANALYZE
using EXPLAIN QUERY PLAN
. If the planner switches back to the composite index, the problem was caused by stale statistics.
Step 3: If performance improves but remains suboptimal, inspect the sqlite_stat1
entries for the affected indexes:
SELECT tbl, idx, stat FROM sqlite_stat1 WHERE tbl = 'attributes';
Ensure the stat
column reflects the expected distribution (e.g., the composite index should show higher selectivity for (name, data_parent_id)
pairs).
Solution: Schedule regular ANALYZE
runs after significant data changes or schema modifications. Consider using PRAGMA optimize
(SQLite 3.18.0+) to automate statistics updates.
2. Force Index Usage with Index Hints
If ANALYZE
does not resolve the issue, override the planner’s index selection using SQLite’s INDEXED BY
clause:
SELECT DISTINCT dbAttrs3.id
FROM
(SELECT ... FROM attributes AS dbAttrs1 ...) AS dbAttrs2,
(SELECT * FROM attributes INDEXED BY attributes_data_parent_id_and_name_index WHERE name = 'asdf') AS dbAttrs3
WHERE ...;
Caution: Hardcoding indexes reduces the planner’s flexibility. Validate that the forced index consistently improves performance across different data distributions.
Alternative: Use NOT INDEXED
to exclude the problematic single-column index:
SELECT * FROM attributes NOT INDEXED WHERE name = 'asdf';
3. Rewrite the Query to Guide the Planner
Modify the query structure to make the composite index more appealing to the planner:
Option A: Separate the OR into UNION ALL subqueries:
SELECT DISTINCT dbAttrs3.id
FROM
dbAttrs2,
(
SELECT * FROM attributes
WHERE name = 'asdf' AND data_parent_id = dbAttrs2.data_parent_id
UNION ALL
SELECT * FROM attributes
WHERE name = 'asdf' AND data_parent_id = dbAttrs2.data_class_id
) AS dbAttrs3
...;
This forces the planner to use the composite index for both branches of the OR.
Option B: Materialize the dbAttrs2
subquery to avoid repeated scans:
WITH dbAttrs2 AS MATERIALIZED (
SELECT DISTINCT dbAttrs1.id, dbAttrs1.data_class_id, dbAttrs1.data_parent_id, dbAttrs1.name
FROM attributes AS dbAttrs1
WHERE dbAttrs1.id = 1234
)
SELECT DISTINCT dbAttrs3.id
FROM dbAttrs2, attributes AS dbAttrs3
WHERE
(dbAttrs3.data_parent_id = dbAttrs2.data_parent_id OR dbAttrs3.data_parent_id = dbAttrs2.data_class_id)
AND dbAttrs3.name = 'asdf'
ORDER BY dbAttrs3.id;
The MATERIALIZED
hint ensures the subquery is computed once, reducing redundant computations.
4. Adjust Index Definitions for Better Coverage
Ensure the composite index includes all columns referenced in the query’s WHERE and JOIN clauses. For example, if the query also filters on data_class_id
, consider a new index:
CREATE INDEX attributes_covering_index ON attributes(name, data_parent_id, data_class_id);
Covering indexes allow the planner to resolve queries without accessing the main table, reducing I/O.
5. Downgrade or Patch SQLite
If immediate resolution is critical, revert to SQLite 3.42.0 or apply a local patch to reintroduce the optimization from commit 796a65fa61373b5d
.
Long-term: Report the regression to SQLite’s maintainers with a reproducible test case (schema, data sample, and query). Collaborate to refine the query planner’s cost model for OR-heavy queries.
This guide provides a comprehensive pathway to diagnose and resolve query planner regressions in SQLite, emphasizing the interplay between index design, statistical metadata, and query structure.