SQLite Automatic Index Creation Despite Existing Schema Index: Causes and Fixes

Understanding Automatic Index Usage with Predefined Indexes in SQLite

Issue Overview: Automatic Indexes Overriding Schema-Defined Indexes

SQLite’s query planner generates automatic (transient) indexes to optimize query performance when it determines that no suitable schema-defined index exists. However, in the observed scenario, SQLite creates an automatic index for a query even though a schema-defined index on the same column(s) is present. This behavior is inconsistent across databases with identical schemas but different data.

The core problem manifests in query execution plans where SQLite selects an automatic index for one database but uses the schema-defined index for another. For example:

Database A uses an automatic covering index (AUTOMATIC COVERING INDEX (projectId=?)) for a LEFT JOIN on version1.projectId.
Database B uses the schema-defined index (version_on_projectId) for the same query.

The table schema includes a predefined index on projectId:

CREATE TABLE "version" (
  ...,
  "projectId" integer NOT NULL
);
-- Schema-defined index:
CREATE INDEX "version_on_projectId" ON "version" ("projectId");

Key observations:

Both databases have identical table schemas and indexes.
The query structure and joins are identical across databases.
The discrepancy occurs in the query planner’s choice of index.

This inconsistency suggests that SQLite’s query planner is making data-driven decisions about index usage, influenced by factors beyond the mere presence of a schema-defined index.

Potential Causes of Automatic Index Creation with Existing Indexes

Three primary factors can lead to this behavior:

1. Collation Mismatch Between Index and Query

SQLite indexes are bound to the collation sequence of their columns. If a query’s WHERE or JOIN clause uses a collation different from the index’s collation, the index will be ignored. For example:

An index on projectId (with BINARY collation) will not be used if the query compares projectId using NOCASE collation.

In the provided schema, projectId is an integer column, which defaults to BINARY collation. The index version_on_projectId explicitly uses BINARY collation (per PRAGMA index_xinfo). However, if the query implicitly or explicitly applies a different collation (e.g., via a CAST or expression), the index will be disregarded.

2. Partial Index Conditions Not Matched by the Query

A partial index includes a subset of rows based on a WHERE clause. If the query’s filter conditions do not match the partial index’s definition, SQLite will ignore the index. In this case, the version_on_projectId index is not partial (partial=0 in PRAGMA index_list), so this is not the direct cause. However, other indexes in the query (e.g., version_isInTrash_projectId) could influence the planner’s decisions.

3. Statistics and Data Distribution Differences

SQLite’s query planner relies on database statistics (collected via ANALYZE) to estimate the cost of different query plans. Differences in data distribution between databases can lead to divergent index choices. For example:

Database A has a skewed distribution of projectId values, making the automatic index more efficient.
Database B has uniform projectId values, favoring the schema-defined index.

The presence of ENABLE_STAT4 in the SQLite build enables histogram-based statistics, which can further refine cost estimates. If one database lacks statistics (i.e., ANALYZE was not run), the planner may default to heuristic-based decisions, including automatic index creation.

Troubleshooting Steps and Solutions

Step 1: Verify Index Collation and Query Compatibility

Check the collation of the schema-defined index:
```
PRAGMA index_xinfo(version_on_projectId);
```
Confirm that the coll column for projectId matches the collation used in the query. For integer columns, this should be BINARY.
Inspect the query for implicit collation changes:
Ensure no functions or expressions alter the collation of projectId. For example:
```
-- This would bypass the index:
SELECT ... WHERE CAST(projectId AS TEXT) = '123';
```

Step 2: Validate Index Usability and Scope

Confirm the index is not partial:
```
PRAGMA index_list(version);
```
Ensure partial=0 for version_on_projectId.
Ensure the index covers all query predicates:
The query joins on version1.projectId = folder.id and counts version1.id. The schema-defined index version_on_projectId includes projectId and the rowid (implicitly mapped to id), making it a covering index. However, if the query required additional columns (e.g., name), the automatic index might include them, making it more efficient.

Step 3: Analyze Database Statistics and Data Distribution

Collect or update statistics:
Run ANALYZE to generate table and index statistics:
```
ANALYZE;
```
This populates the sqlite_stat1, sqlite_stat4, and sqlite_stat3 tables, which guide the query planner.
Compare statistics between databases:
Export and compare statistics tables:
```
-- Database A
.dump sqlite_stat1
-- Database B
.dump sqlite_stat1
```
Look for discrepancies in projectId histogram data or row counts.
Force index usage for testing:
Use INDEXED BY to override the planner’s choice:
```
SELECT ...
FROM "folder"
LEFT JOIN "version" "version1" INDEXED BY (version_on_projectId) ...
```
Compare performance and plans to diagnose planner miscalculations.

Step 4: Optimize Query and Index Design

Create a composite index:
If the query frequently filters on multiple columns (e.g., projectId and isInTrash), a composite index may outperform single-column indexes:
```
CREATE INDEX version_projectId_isInTrash ON version(projectId, isInTrash);
```
Reimplement the query to reduce complexity:
Simplify aggregate operations (e.g., COUNT(DISTINCT)) that force temporary B-tree usage. For example, precompute counts in subqueries:
```
SELECT folder.*, v1.total, v2.trash, v3.album
FROM folder
LEFT JOIN (SELECT projectId, COUNT(DISTINCT id) AS total FROM version GROUP BY projectId) v1 ...
```

Step 5: Disable Automatic Indexing (If Necessary)

Temporarily disable automatic indexes to test planner behavior:

PRAGMA automatic_index = OFF;

If the schema-defined index is now used, the automatic index was redundant. However, this is not a production fix, as automatic indexes often improve performance.

Final Recommendations

Standardize collation for indexed columns and queries.
Regularly run ANALYZE to keep statistics current.
Audit indexes for partial/filtered conditions and coverage.
Simplify complex queries to reduce planner ambiguity.

By addressing these factors, SQLite’s query planner can consistently leverage schema-defined indexes, eliminating unnecessary automatic index creation.

SQLite Automatic Index Creation Despite Existing Schema Index: Causes and Fixes