SQLite Automatic Index Creation Despite Existing Schema Index: Causes and Fixes
Understanding Automatic Index Usage with Predefined Indexes in SQLite
Issue Overview: Automatic Indexes Overriding Schema-Defined Indexes
SQLite’s query planner generates automatic (transient) indexes to optimize query performance when it determines that no suitable schema-defined index exists. However, in the observed scenario, SQLite creates an automatic index for a query even though a schema-defined index on the same column(s) is present. This behavior is inconsistent across databases with identical schemas but different data.
The core problem manifests in query execution plans where SQLite selects an automatic index for one database but uses the schema-defined index for another. For example:
- Database A uses an automatic covering index (
AUTOMATIC COVERING INDEX (projectId=?)
) for aLEFT JOIN
onversion1.projectId
. - Database B uses the schema-defined index (
version_on_projectId
) for the same query.
The table schema includes a predefined index on projectId
:
CREATE TABLE "version" (
...,
"projectId" integer NOT NULL
);
-- Schema-defined index:
CREATE INDEX "version_on_projectId" ON "version" ("projectId");
Key observations:
- Both databases have identical table schemas and indexes.
- The query structure and joins are identical across databases.
- The discrepancy occurs in the query planner’s choice of index.
This inconsistency suggests that SQLite’s query planner is making data-driven decisions about index usage, influenced by factors beyond the mere presence of a schema-defined index.
Potential Causes of Automatic Index Creation with Existing Indexes
Three primary factors can lead to this behavior:
1. Collation Mismatch Between Index and Query
SQLite indexes are bound to the collation sequence of their columns. If a query’s WHERE
or JOIN
clause uses a collation different from the index’s collation, the index will be ignored. For example:
- An index on
projectId
(withBINARY
collation) will not be used if the query comparesprojectId
usingNOCASE
collation.
In the provided schema, projectId
is an integer
column, which defaults to BINARY
collation. The index version_on_projectId
explicitly uses BINARY
collation (per PRAGMA index_xinfo
). However, if the query implicitly or explicitly applies a different collation (e.g., via a CAST
or expression), the index will be disregarded.
2. Partial Index Conditions Not Matched by the Query
A partial index includes a subset of rows based on a WHERE
clause. If the query’s filter conditions do not match the partial index’s definition, SQLite will ignore the index. In this case, the version_on_projectId
index is not partial (partial=0
in PRAGMA index_list
), so this is not the direct cause. However, other indexes in the query (e.g., version_isInTrash_projectId
) could influence the planner’s decisions.
3. Statistics and Data Distribution Differences
SQLite’s query planner relies on database statistics (collected via ANALYZE
) to estimate the cost of different query plans. Differences in data distribution between databases can lead to divergent index choices. For example:
- Database A has a skewed distribution of
projectId
values, making the automatic index more efficient. - Database B has uniform
projectId
values, favoring the schema-defined index.
The presence of ENABLE_STAT4
in the SQLite build enables histogram-based statistics, which can further refine cost estimates. If one database lacks statistics (i.e., ANALYZE
was not run), the planner may default to heuristic-based decisions, including automatic index creation.
Troubleshooting Steps and Solutions
Step 1: Verify Index Collation and Query Compatibility
Check the collation of the schema-defined index:
PRAGMA index_xinfo(version_on_projectId);
Confirm that the
coll
column forprojectId
matches the collation used in the query. For integer columns, this should beBINARY
.Inspect the query for implicit collation changes:
Ensure no functions or expressions alter the collation ofprojectId
. For example:-- This would bypass the index: SELECT ... WHERE CAST(projectId AS TEXT) = '123';
Step 2: Validate Index Usability and Scope
Confirm the index is not partial:
PRAGMA index_list(version);
Ensure
partial=0
forversion_on_projectId
.Ensure the index covers all query predicates:
The query joins onversion1.projectId = folder.id
and countsversion1.id
. The schema-defined indexversion_on_projectId
includesprojectId
and the rowid (implicitly mapped toid
), making it a covering index. However, if the query required additional columns (e.g.,name
), the automatic index might include them, making it more efficient.
Step 3: Analyze Database Statistics and Data Distribution
Collect or update statistics:
RunANALYZE
to generate table and index statistics:ANALYZE;
This populates the
sqlite_stat1
,sqlite_stat4
, andsqlite_stat3
tables, which guide the query planner.Compare statistics between databases:
Export and compare statistics tables:-- Database A .dump sqlite_stat1 -- Database B .dump sqlite_stat1
Look for discrepancies in
projectId
histogram data or row counts.Force index usage for testing:
UseINDEXED BY
to override the planner’s choice:SELECT ... FROM "folder" LEFT JOIN "version" "version1" INDEXED BY (version_on_projectId) ...
Compare performance and plans to diagnose planner miscalculations.
Step 4: Optimize Query and Index Design
Create a composite index:
If the query frequently filters on multiple columns (e.g.,projectId
andisInTrash
), a composite index may outperform single-column indexes:CREATE INDEX version_projectId_isInTrash ON version(projectId, isInTrash);
Reimplement the query to reduce complexity:
Simplify aggregate operations (e.g.,COUNT(DISTINCT)
) that force temporary B-tree usage. For example, precompute counts in subqueries:SELECT folder.*, v1.total, v2.trash, v3.album FROM folder LEFT JOIN (SELECT projectId, COUNT(DISTINCT id) AS total FROM version GROUP BY projectId) v1 ...
Step 5: Disable Automatic Indexing (If Necessary)
Temporarily disable automatic indexes to test planner behavior:
PRAGMA automatic_index = OFF;
If the schema-defined index is now used, the automatic index was redundant. However, this is not a production fix, as automatic indexes often improve performance.
Final Recommendations
- Standardize collation for indexed columns and queries.
- Regularly run
ANALYZE
to keep statistics current. - Audit indexes for partial/filtered conditions and coverage.
- Simplify complex queries to reduce planner ambiguity.
By addressing these factors, SQLite’s query planner can consistently leverage schema-defined indexes, eliminating unnecessary automatic index creation.