SQLite Query Planner Selects Suboptimal Index Despite Existing Multi-Column Index

Issue Overview: Suboptimal Join Order and Index Selection in Multi-Table Query with Filtered Scan ID

The core challenge involves a SQLite query planner selecting an inefficient execution plan for a four-table join query filtered by file.scan_id. The query joins file, matches, unique_info, and scan tables, with a filter condition on file.scan_id. The database contains millions of rows in unique_info but only thousands in file, making the planner’s decision to start with a full scan of unique_info catastrophic for performance. The user created two indexes on file.scan_id: a multi-column index (idx_file_scan_severity_likelihood) and a single-column index (file_scan). The planner ignored the multi-column index but used the single-column index after its creation, resolving the performance issue. The problem persisted due to outdated table statistics caused by improper execution of the ANALYZE command.

The query’s performance degradation emerged suddenly, likely due to data growth or changes in value distribution within file.scan_id or related columns. The multi-column index’s structure (scan_id ASC, severity_level ASC, highest_likelihood DESC) should theoretically support the file.scan_id filter, but the planner’s cost-based optimizer dismissed it in favor of a covering index on unique_info (idx_unique_info_info_pattern). This decision suggests the optimizer miscalculated the relative costs of competing access paths, prioritizing index coverage over join order efficiency.

Key entities include:

  • file: Central table filtered by scan_id, with two indexes.
  • matches: Junction table linking file, unique_info, and scan.
  • unique_info: Large table storing unique metadata.
  • scan: Small reference table.
  • idx_file_scan_severity_likelihood: Multi-column index on file.scan_id, severity_level, highest_likelihood.
  • file_scan: Single-column index on file.scan_id.

The planner’s initial choice to start with unique_info forced a full index scan, multiplying the cost of joining millions of rows with smaller tables. The single-column index provided a direct path to filter file first, drastically reducing the working dataset.

Possible Causes: Outdated Statistics, Index Selectivity, and Multi-Column Index Misalignment

Outdated or Missing Table/Index Statistics

SQLite’s query planner relies on statistical metadata stored in the sqlite_stat1, sqlite_stat4, and sqlite_stat3 tables to estimate the selectivity of indexes and the cardinality of filtered rows. These statistics are generated or updated via the ANALYZE command. If this data is absent or stale, the planner assumes uniform data distribution and default table sizes, leading to suboptimal index selection. In this scenario, the user attempted to run ANALYZE using incorrect syntax (pragma('ANALYZE')), which does nothing. Without accurate statistics, the planner could not recognize that filtering file first via scan_id would reduce the result set to a few thousand rows instead of scanning millions in unique_info.

Multi-Column Index Structure and Usage Rules

A multi-column index is only considered for a query if the filter conditions include the leftmost column(s) of the index. While idx_file_scan_severity_likelihood starts with scan_id, the presence of additional columns (severity_level, highest_likelihood) introduces complexities. SQLite evaluates whether the index’s structure justifies its use based on the query’s requirements. If the query does not reference subsequent columns in the index, the planner may deem the index less optimal than a smaller, single-column index. This is especially true if the multi-column index has wider rows (due to included columns) or if its statistics suggest lower selectivity for the scan_id column alone.

Index Coverage and Join Order Bias

The idx_unique_info_info_pattern index is a covering index for the unique_info.info column, which is referenced in the SELECT clause. A covering index allows SQLite to retrieve column values without accessing the base table, reducing I/O. However, in this case, the planner overvalued this benefit compared to the cost of scanning the entire unique_info table. The optimizer’s cost model might have incorrectly assumed that starting with unique_info and joining outward would minimize disk access, not accounting for the sheer size of unique_info relative to other tables.

Query Complexity and Join Reordering

SQLite’s join reordering algorithm uses a greedy heuristic to minimize the estimated cost of executing joins. When statistics are outdated, the algorithm cannot accurately compare the costs of different join orders. The planner might have prioritized eliminating unique_info early due to the presence of a covering index, unaware that the file.scan_id filter would drastically reduce the dataset. The lack of accurate row count estimates for file after applying the scan_id filter led to a miscalculation in the total cost of the join sequence.

Troubleshooting Steps, Solutions & Fixes: Updating Statistics, Index Optimization, and Plan Forcing

Step 1: Validate and Refresh Table/Index Statistics

Execute ANALYZE correctly to generate or update statistics. The correct syntax is:

ANALYZE;

or, for finer control:

PRAGMA analysis_limit=400; -- Adjust sample size for large tables
PRAGMA optimize;

Verify that statistics exist by querying:

SELECT * FROM sqlite_stat1 WHERE tbl IN ('file', 'matches', 'unique_info', 'scan');

If the file table’s statistics are missing or show outdated row counts (e.g., nrow not reflecting recent inserts), rerun ANALYZE.

Step 2: Evaluate Multi-Column Index Usability

Confirm that the multi-column index can serve the file.scan_id filter. Use EXPLAIN QUERY PLAN with a simplified query isolating the file table:

EXPLAIN QUERY PLAN
SELECT * FROM file WHERE scan_id = 'FK73ULf97k9mYd0Ee4HH';

If the output shows SCAN instead of SEARCH, the index is not being used. Check the index’s leftmost column alignment with the filter column. Since idx_file_scan_severity_likelihood starts with scan_id, it should be eligible. If it’s still ignored, consider index hints or a single-column index.

Step 3: Compare Index Selectivity and Sizing

Calculate the selectivity of file.scan_id to determine if the single-column index is inherently better. Run:

SELECT 
  COUNT(DISTINCT scan_id) AS distinct_scan_ids,
  COUNT(*) AS total_rows,
  (COUNT(DISTINCT scan_id) * 1.0 / COUNT(*)) AS selectivity
FROM file;

A low selectivity (many rows per scan_id) favors the single-column index. A high selectivity (few rows per scan_id) should make the multi-column index viable if statistics are current.

Step 4: Force Index Usage for Comparative Testing

Temporarily force the planner to use idx_file_scan_severity_likelihood to compare performance:

SELECT ... FROM file INDEXED BY idx_file_scan_severity_likelihood ...;

If performance improves, the issue is solely due to outdated statistics. If not, the multi-column index may be inherently unsuitable due to its width or fragmentation.

Step 5: Rebuild Indexes and Compact the Database

Fragmented indexes can cause the planner to discard them. Rebuild the multi-column index:

REINDEX idx_file_scan_severity_likelihood;

For a comprehensive cleanup, vacuum the database:

VACUUM;

Step 6: Simplify the Query or Break into Subqueries

Reduce join complexity by materializing filtered results first:

WITH filtered_files AS (
  SELECT id, path, scan_id FROM file WHERE scan_id = 'FK73ULf97k9mYd0Ee4HH'
)
SELECT 
  matches.id, 
  filtered_files.path AS file_path,
  unique_info.info AS match,
  matches.context,
  scan.target AS scan_target
FROM filtered_files
JOIN matches ON filtered_files.id = matches.file_id
JOIN unique_info ON matches.unique_info_id = unique_info.id
JOIN scan ON matches.scan_id = scan.id;

This forces the planner to process file first, mimicking the desired behavior.

Step 7: Adjust Schema Design for Join Efficiency

Add covering indexes to eliminate table accesses during joins. For example, ensure matches has indexes on file_id, unique_info_id, and scan_id:

CREATE INDEX idx_matches_file ON matches(file_id);
CREATE INDEX idx_matches_unique_info ON matches(unique_info_id);
CREATE INDEX idx_matches_scan ON matches(scan_id);

Covering indexes for frequently accessed columns (e.g., unique_info(info)) further reduce I/O.

Step 8: Monitor Long-Term Plan Stability

Use SQLite’s runtime status functions to log query plans periodically:

SELECT sql, plan FROM sqlite_stmt WHERE sql LIKE '%matches%file%';

Automate statistics regeneration after significant data changes via application triggers or scheduled jobs.

Step 9: Upgrade SQLite and Validate Optimizer Enhancements

Ensure the latest SQLite version is used, as optimizer improvements are frequent. For example, version 3.47.2 includes fixes for nested loop join costing. Test the query under newer versions if possible.

Step 10: Profile Query Execution with Actual Timing

Use OS-level tools (time, perf) or SQLite’s internal timer to measure real execution times:

.timer on
SELECT ...;

Compare plans with and without forced indexes to quantify the impact of index selection.

By systematically validating statistics, refining index design, and guiding the planner through query restructuring, users can resolve suboptimal index selection and restore performance. The interplay between accurate statistics, index structure, and join ordering requires continuous monitoring, especially in evolving datasets.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *