SQLite Query Planner Suboptimal Index Selection Issue
SQLite Query Planner Prefers Superset Index Over More Efficient Option
The core issue revolves around the SQLite query planner’s tendency to favor an index that covers more columns in the WHERE clause, even when a more efficient index exists. In this scenario, the query involves joining three tables: Games
, Relations
, and Classifications
. The query planner selects Relations_Index2
over Relations_Index1
, despite the latter being statistically more efficient for the given query. This behavior is rooted in SQLite’s internal heuristics, which prioritize indexes that cover more WHERE clause conditions, even when the difference in efficiency is negligible or counterproductive.
The query in question is:
SELECT
Relations.IDSecondDatum, Games.Result
FROM
Games, Relations, Classifications
WHERE
(Games.BlackID=?1) AND
(Relations.TypeFirstDatum==2) AND
(Relations.IDFirstDatum=Games.ID) AND
(Relations.TypeSecondDatum=3) AND
(Relations.IDSecondDatum=Classifications.ID) AND
(Classifications.System=?4) AND
(Classifications.GameType=?5);
The indices on the Relations
table are:
CREATE INDEX Relations_Index1 ON Relations (IDFirstDatum, IDSecondDatum, TypeFirstDatum, TypeSecondDatum);
CREATE INDEX Relations_Index2 ON Relations (TypeFirstDatum, TypeSecondDatum, IDFirstDatum, IDSecondDatum);
CREATE INDEX Relations_Index3 ON Relations (IDSecondDatum, IDFirstDatum, TypeSecondDatum, TypeFirstDatum);
CREATE INDEX Relations_Index4 ON Relations (TypeSecondDatum, TypeFirstDatum, IDSecondDatum, IDFirstDatum);
The query planner’s choice of Relations_Index2
over Relations_Index1
is based on the assumption that an index covering more WHERE clause conditions is inherently better. However, in this case, Relations_Index1
is statistically more efficient, as it narrows down the search to a single row with fewer iterations.
SQLite Heuristics and Index Selection Logic
SQLite’s query planner employs a heuristic that favors indexes covering more WHERE clause conditions, even when the statistical data suggests that a simpler index would be more efficient. This heuristic is designed to account for potential variations in data distribution and to ensure that the planner makes a conservative choice. However, this approach can lead to suboptimal performance in cases where the statistical data is accurate and the simpler index is demonstrably more efficient.
The statistical data for the indices on the Relations
table is as follows:
Relations|Relations_Index4|80965 80965 80965 40 1
Relations|Relations_Index3|80965 40 1 1 1
Relations|Relations_Index2|80965 80965 80965 1 1
Relations|Relations_Index1|80965 1 1 1 1
The numbers represent the expected number of rows that would be selected using equality constraints on the first, second, third, and fourth terms of the index, respectively. For Relations_Index1
, the fourth number is 1, indicating that an equality constraint on all four columns would narrow the search to a single row. Similarly, for Relations_Index2
, the fourth number is also 1, suggesting that it would also narrow the search to a single row.
However, the key difference lies in the number of iterations required to reach that single row. Relations_Index1
requires only one iteration, while Relations_Index2
may require up to 80965 iterations in the worst case. This discrepancy is due to the order of the columns in the index and how SQLite traverses the index tree.
Forcing Index Usage and Optimizing Query Performance
To address the issue of suboptimal index selection, SQLite provides the INDEXED BY
clause, which allows developers to force the query planner to use a specific index. This can be particularly useful for debugging and performance tuning, although it is generally not recommended for production use due to the potential for changes in data distribution over time.
For example, to force the use of Relations_Index1
, the query can be modified as follows:
SELECT
Relations.IDSecondDatum, Games.Result
FROM
Games, Relations INDEXED BY Relations_Index1, Classifications
WHERE
(Games.BlackID=?1) AND
(Relations.TypeFirstDatum==2) AND
(Relations.IDFirstDatum=Games.ID) AND
(Relations.TypeSecondDatum=3) AND
(Relations.IDSecondDatum=Classifications.ID) AND
(Classifications.System=?4) AND
(Classifications.GameType=?5);
In this case, forcing the use of Relations_Index1
results in a significant performance improvement, with the query execution time dropping from 99.35 seconds to 0.0177 seconds. This demonstrates the importance of understanding the underlying statistical data and the potential impact of index selection on query performance.
Additionally, it is crucial to ensure that the statistical data used by the query planner is up-to-date. Running the ANALYZE
command can help refresh the statistics and improve the planner’s decision-making process. For example:
ANALYZE;
This command updates the statistical data for all indexes in the database, ensuring that the query planner has the most accurate information available when selecting an index.
In conclusion, while SQLite’s query planner is generally effective at selecting the most efficient index, there are cases where its heuristics can lead to suboptimal performance. By understanding the underlying statistical data and using tools like the INDEXED BY
clause and the ANALYZE
command, developers can fine-tune their queries to achieve optimal performance.