SQLite Query Planner Selects Wrong Index When ORDER BY and WHERE Clauses Compete
Issue Overview: ORDER BY Clause Forces Suboptimal Index Usage Despite WHERE Filter
The core problem arises when a SQLite query combines a WHERE clause filtering on one column with an ORDER BY clause sorting on another column. The query planner may prioritize using an index aligned with the ORDER BY column over the WHERE filter’s index, leading to full or partial table scans instead of efficient index-based searches. This results in significant performance degradation—from sub-millisecond execution to multi-second delays—even when the WHERE condition is highly selective.
In the provided scenario, the treatments
table has two relevant indexes:
ix_treatments_checkinTime
on thecheckinTime
column (used in the WHERE clause)- A unique index implicitly created by the
UNIQUE
constraint ontreatmentId
(used in the ORDER BY clause)
The query planner opts to scan the treatmentId
index to satisfy the ORDER BY treatmentId ASC
clause, ignoring the checkinTime
filter’s index. This occurs despite the WHERE checkinTime >= ...
condition being designed to limit the result set before sorting. The planner incorrectly assumes that leveraging the treatmentId
index for ordering is more efficient than applying the WHERE filter first and then sorting the results.
Possible Causes: Query Planner Misestimation of Cost and Index Utility
Index Selection Heuristics for Sorting vs. Filtering
SQLite’s query planner evaluates whether using an index for sorting (avoiding a temporary B-tree) outweighs the cost of scanning rows that do not match the WHERE condition. If the planner estimates that the WHERE clause will return many rows, it may prefer an index that avoids sorting, even if this requires scanning irrelevant rows. Conversely, if the WHERE clause is highly selective, using its index and then sorting the small result set would be faster. Misestimation of selectivity—often due to outdated or incomplete statistics—leads to suboptimal index choices.Implicit Unique Index on treatmentId
TheUNIQUE
constraint ontreatmentId
creates an implicit index (sqlite_autoindex_treatments_1
), which the query planner prioritizes for sorting operations. This index provides inherent ordering, making it attractive forORDER BY
clauses. However, when combined with a WHERE filter on another column, the planner may incorrectly assume that scanning this ordered index and applying the WHERE filter incrementally is cheaper than filtering first and sorting afterward.Expression Complexity in WHERE Clause
The use ofstrftime('%s', '2023-04-06') * 1000
in the WHERE clause introduces a runtime-evaluated expression. SQLite cannot precompute this value during planning if it treats the expression as non-constant, leading to conservative selectivity estimates. The planner may default to assuming the WHERE filter matches a large portion of the table, deeming thecheckinTime
index less useful.Lack of Composite Index Covering Both Clauses
Without a composite index on(checkinTime, treatmentId)
, the planner cannot satisfy both the WHERE filter and ORDER BY clause using a single index. This forces a choice between filtering efficiency and sorting efficiency, with the planner often prioritizing the latter due to the perceived cost of temporary sorting operations.
Troubleshooting Steps, Solutions & Fixes: Forcing Optimal Index Selection and Improving Planner Accuracy
Step 1: Validate and Update Statistics with ANALYZE
Action:
Run ANALYZE
to refresh table and index statistics. SQLite uses these statistics to estimate row counts and index usefulness.
Verification:
ANALYZE;
SELECT * FROM sqlite_stat1 WHERE tbl = 'treatments';
Ensure the stat
column for ix_treatments_checkinTime
reflects the approximate number of rows matching typical checkinTime
filters. Outdated statistics may cause the planner to underestimate the selectivity of the WHERE clause.
Outcome:
If statistics were stale, re-running ANALYZE
might correct the planner’s cost estimates. In the original discussion, this did not resolve the issue, indicating deeper heuristics at play.
Step 2: Use the Unary + Operator to Disrupt Index Recognition for ORDER BY
Action:
Modify the ORDER BY clause to include a no-op unary operator, such as +treatmentId
, which prevents the planner from recognizing the column as matching the implicit unique index.
Modified Query:
SELECT treatmentId
FROM treatments
WHERE checkinTime >= strftime('%s', '2023-04-06') * 1000
ORDER BY +treatmentId ASC
LIMIT 30;
Mechanism:
The +
operator converts treatmentId
into an expression, making it ineligible for index-based sorting. This forces the planner to:
- Use the
ix_treatments_checkinTime
index for filtering. - Sort the filtered results using a temporary B-tree.
Tradeoffs:
- Pros: Eliminates full-table scans; leverages the WHERE clause index.
- Cons: Adds minor overhead for temporary sorting (acceptable for small result sets).
Programmatic Adaptation:
When generating SQL dynamically, append +
to ORDER BY columns if query performance degrades due to index conflicts. Implement conditional logic to apply this tweak only when a WHERE clause on another column exists.
Step 3: Force Index Selection with INDEXED BY (Use Sparingly)
Action:
Explicitly specify the ix_treatments_checkinTime
index using the INDEXED BY
clause.
Modified Query:
SELECT treatmentId
FROM treatments INDEXED BY ix_treatments_checkinTime
WHERE checkinTime >= strftime('%s', '2023-04-06') * 1000
ORDER BY treatmentId ASC
LIMIT 30;
Considerations:
- Risks: Hardcoding indexes makes queries brittle. If the index is renamed or dropped, the query will fail.
- Use Case: Reserve this for critical queries where planner behavior is consistently unreliable and other optimizations fail.
Step 4: Signal Low Selectivity with the unlikely() Function
Action:
Wrap the WHERE clause in unlikely()
to hint that the condition is seldom true, encouraging the planner to prioritize filtering.
Modified Query:
SELECT treatmentId
FROM treatments
WHERE unlikely(checkinTime >= strftime('%s', '2023-04-06') * 1000)
ORDER BY treatmentId ASC
LIMIT 30;
Mechanism:
unlikely()
adjusts the planner’s row-count estimate downward, making the ix_treatments_checkinTime
index more appealing. This is effective when the WHERE clause is highly selective (e.g., recent timestamps in a large table).
Verification:
Check the query plan with .eqp on
to confirm the desired index is used.
Step 5: Create a Composite Covering Index
Action:
Add a composite index on (checkinTime, treatmentId)
to allow the planner to both filter and sort using a single index.
SQL:
CREATE INDEX ix_treatments_checkinTime_treatmentId
ON treatments (checkinTime, treatmentId);
Mechanism:
The composite index allows:
- Efficient range scans on
checkinTime
. - Presorted
treatmentId
values within eachcheckinTime
group.
Tradeoffs:
- Pros: Eliminates sorting overhead and ensures optimal index usage.
- Cons: Increases storage and write latency. Overkill if other query patterns don’t benefit from this index.
Usage:
After creating the index, test the original query without modifications. The planner should recognize that the composite index satisfies both clauses.
Step 6: Rebuild the Schema to Eliminate Redundant Indexes
Action:
Remove the explicit ix_treatments_treatmentId
index, as the UNIQUE
constraint already creates an implicit index.
SQL:
DROP INDEX ix_treatments_treatmentId;
Rationale:
Redundant indexes waste storage and confuse the query planner by presenting multiple candidates for the same column. The implicit unique index already provides the necessary ordering for treatmentId
.
Verification:
Check indexes with .schema treatments
and confirm ix_treatments_treatmentId
is removed.
Step 7: Use Literal Values or Precomputed Constants in WHERE Clauses
Action:
Precompute the checkinTime
value outside the query and pass it as a literal.
Example:
# Python pseudocode
timestamp = int(datetime(2023, 4, 6).timestamp()) * 1000
query = f"""
SELECT treatmentId
FROM treatments
WHERE checkinTime >= {timestamp}
ORDER BY treatmentId ASC
LIMIT 30;
"""
Mechanism:
Literal values allow the planner to estimate selectivity more accurately during compilation. Runtime-evaluated expressions (e.g., strftime('%s', ...) * 1000
) may be treated as opaque, leading to default selectivity assumptions.
Step 8: Partition Data to Reduce Index Contention
Action:
For time-series data, partition the treatments
table by checkinTime
(e.g., monthly or yearly shards).
Example Schema:
CREATE TABLE treatments_2023_04 (
id INTEGER PRIMARY KEY,
treatmentId TEXT UNIQUE NOT NULL CHECK(Length(treatmentId) = 32),
checkinTime INTEGER,
...
);
Mechanism:
Smaller tables reduce the number of rows scanned during ORDER BY operations and improve the effectiveness of per-partition indexes.
Tradeoffs:
- Pros: Significantly improves query performance for time-bound searches.
- Cons: Adds complexity to query generation and data management.
Final Recommendations
- Default to the Unary + Fix: Implement
ORDER BY +column
in your SQL generator for queries combining WHERE and ORDER BY clauses on different columns. This is minimally invasive and effectively disambiguates index selection. - Monitor Query Plans: Use
.eqp on
insqlite3
CLI orEXPLAIN QUERY PLAN
in application logs to detect regressions. - Precompute WHERE Values: Avoid runtime expressions in WHERE clauses to aid planner accuracy.
- Audit Indexes: Regularly review indexes for redundancy and alignment with query patterns.
By systematically applying these fixes, you can coerce SQLite’s query planner into making optimal index choices while maintaining the flexibility of programmatic query generation.