SQLite Query Planner Selects Wrong Index When ORDER BY and WHERE Clauses Compete

Issue Overview: ORDER BY Clause Forces Suboptimal Index Usage Despite WHERE Filter

The core problem arises when a SQLite query combines a WHERE clause filtering on one column with an ORDER BY clause sorting on another column. The query planner may prioritize using an index aligned with the ORDER BY column over the WHERE filter’s index, leading to full or partial table scans instead of efficient index-based searches. This results in significant performance degradation—from sub-millisecond execution to multi-second delays—even when the WHERE condition is highly selective.

In the provided scenario, the treatments table has two relevant indexes:

  1. ix_treatments_checkinTime on the checkinTime column (used in the WHERE clause)
  2. A unique index implicitly created by the UNIQUE constraint on treatmentId (used in the ORDER BY clause)

The query planner opts to scan the treatmentId index to satisfy the ORDER BY treatmentId ASC clause, ignoring the checkinTime filter’s index. This occurs despite the WHERE checkinTime >= ... condition being designed to limit the result set before sorting. The planner incorrectly assumes that leveraging the treatmentId index for ordering is more efficient than applying the WHERE filter first and then sorting the results.

Possible Causes: Query Planner Misestimation of Cost and Index Utility

  1. Index Selection Heuristics for Sorting vs. Filtering
    SQLite’s query planner evaluates whether using an index for sorting (avoiding a temporary B-tree) outweighs the cost of scanning rows that do not match the WHERE condition. If the planner estimates that the WHERE clause will return many rows, it may prefer an index that avoids sorting, even if this requires scanning irrelevant rows. Conversely, if the WHERE clause is highly selective, using its index and then sorting the small result set would be faster. Misestimation of selectivity—often due to outdated or incomplete statistics—leads to suboptimal index choices.

  2. Implicit Unique Index on treatmentId
    The UNIQUE constraint on treatmentId creates an implicit index (sqlite_autoindex_treatments_1), which the query planner prioritizes for sorting operations. This index provides inherent ordering, making it attractive for ORDER BY clauses. However, when combined with a WHERE filter on another column, the planner may incorrectly assume that scanning this ordered index and applying the WHERE filter incrementally is cheaper than filtering first and sorting afterward.

  3. Expression Complexity in WHERE Clause
    The use of strftime('%s', '2023-04-06') * 1000 in the WHERE clause introduces a runtime-evaluated expression. SQLite cannot precompute this value during planning if it treats the expression as non-constant, leading to conservative selectivity estimates. The planner may default to assuming the WHERE filter matches a large portion of the table, deeming the checkinTime index less useful.

  4. Lack of Composite Index Covering Both Clauses
    Without a composite index on (checkinTime, treatmentId), the planner cannot satisfy both the WHERE filter and ORDER BY clause using a single index. This forces a choice between filtering efficiency and sorting efficiency, with the planner often prioritizing the latter due to the perceived cost of temporary sorting operations.

Troubleshooting Steps, Solutions & Fixes: Forcing Optimal Index Selection and Improving Planner Accuracy

Step 1: Validate and Update Statistics with ANALYZE

Action:
Run ANALYZE to refresh table and index statistics. SQLite uses these statistics to estimate row counts and index usefulness.

Verification:

ANALYZE;
SELECT * FROM sqlite_stat1 WHERE tbl = 'treatments';

Ensure the stat column for ix_treatments_checkinTime reflects the approximate number of rows matching typical checkinTime filters. Outdated statistics may cause the planner to underestimate the selectivity of the WHERE clause.

Outcome:
If statistics were stale, re-running ANALYZE might correct the planner’s cost estimates. In the original discussion, this did not resolve the issue, indicating deeper heuristics at play.


Step 2: Use the Unary + Operator to Disrupt Index Recognition for ORDER BY

Action:
Modify the ORDER BY clause to include a no-op unary operator, such as +treatmentId, which prevents the planner from recognizing the column as matching the implicit unique index.

Modified Query:

SELECT treatmentId 
FROM treatments 
WHERE checkinTime >= strftime('%s', '2023-04-06') * 1000 
ORDER BY +treatmentId ASC 
LIMIT 30;

Mechanism:
The + operator converts treatmentId into an expression, making it ineligible for index-based sorting. This forces the planner to:

  1. Use the ix_treatments_checkinTime index for filtering.
  2. Sort the filtered results using a temporary B-tree.

Tradeoffs:

  • Pros: Eliminates full-table scans; leverages the WHERE clause index.
  • Cons: Adds minor overhead for temporary sorting (acceptable for small result sets).

Programmatic Adaptation:
When generating SQL dynamically, append + to ORDER BY columns if query performance degrades due to index conflicts. Implement conditional logic to apply this tweak only when a WHERE clause on another column exists.


Step 3: Force Index Selection with INDEXED BY (Use Sparingly)

Action:
Explicitly specify the ix_treatments_checkinTime index using the INDEXED BY clause.

Modified Query:

SELECT treatmentId 
FROM treatments INDEXED BY ix_treatments_checkinTime 
WHERE checkinTime >= strftime('%s', '2023-04-06') * 1000 
ORDER BY treatmentId ASC 
LIMIT 30;

Considerations:

  • Risks: Hardcoding indexes makes queries brittle. If the index is renamed or dropped, the query will fail.
  • Use Case: Reserve this for critical queries where planner behavior is consistently unreliable and other optimizations fail.

Step 4: Signal Low Selectivity with the unlikely() Function

Action:
Wrap the WHERE clause in unlikely() to hint that the condition is seldom true, encouraging the planner to prioritize filtering.

Modified Query:

SELECT treatmentId 
FROM treatments 
WHERE unlikely(checkinTime >= strftime('%s', '2023-04-06') * 1000)
ORDER BY treatmentId ASC 
LIMIT 30;

Mechanism:
unlikely() adjusts the planner’s row-count estimate downward, making the ix_treatments_checkinTime index more appealing. This is effective when the WHERE clause is highly selective (e.g., recent timestamps in a large table).

Verification:
Check the query plan with .eqp on to confirm the desired index is used.


Step 5: Create a Composite Covering Index

Action:
Add a composite index on (checkinTime, treatmentId) to allow the planner to both filter and sort using a single index.

SQL:

CREATE INDEX ix_treatments_checkinTime_treatmentId 
ON treatments (checkinTime, treatmentId);

Mechanism:
The composite index allows:

  1. Efficient range scans on checkinTime.
  2. Presorted treatmentId values within each checkinTime group.

Tradeoffs:

  • Pros: Eliminates sorting overhead and ensures optimal index usage.
  • Cons: Increases storage and write latency. Overkill if other query patterns don’t benefit from this index.

Usage:
After creating the index, test the original query without modifications. The planner should recognize that the composite index satisfies both clauses.


Step 6: Rebuild the Schema to Eliminate Redundant Indexes

Action:
Remove the explicit ix_treatments_treatmentId index, as the UNIQUE constraint already creates an implicit index.

SQL:

DROP INDEX ix_treatments_treatmentId;

Rationale:
Redundant indexes waste storage and confuse the query planner by presenting multiple candidates for the same column. The implicit unique index already provides the necessary ordering for treatmentId.

Verification:
Check indexes with .schema treatments and confirm ix_treatments_treatmentId is removed.


Step 7: Use Literal Values or Precomputed Constants in WHERE Clauses

Action:
Precompute the checkinTime value outside the query and pass it as a literal.

Example:

# Python pseudocode
timestamp = int(datetime(2023, 4, 6).timestamp()) * 1000
query = f"""
SELECT treatmentId 
FROM treatments 
WHERE checkinTime >= {timestamp} 
ORDER BY treatmentId ASC 
LIMIT 30;
"""

Mechanism:
Literal values allow the planner to estimate selectivity more accurately during compilation. Runtime-evaluated expressions (e.g., strftime('%s', ...) * 1000) may be treated as opaque, leading to default selectivity assumptions.


Step 8: Partition Data to Reduce Index Contention

Action:
For time-series data, partition the treatments table by checkinTime (e.g., monthly or yearly shards).

Example Schema:

CREATE TABLE treatments_2023_04 (
  id INTEGER PRIMARY KEY,
  treatmentId TEXT UNIQUE NOT NULL CHECK(Length(treatmentId) = 32),
  checkinTime INTEGER,
  ...
);

Mechanism:
Smaller tables reduce the number of rows scanned during ORDER BY operations and improve the effectiveness of per-partition indexes.

Tradeoffs:

  • Pros: Significantly improves query performance for time-bound searches.
  • Cons: Adds complexity to query generation and data management.

Final Recommendations

  1. Default to the Unary + Fix: Implement ORDER BY +column in your SQL generator for queries combining WHERE and ORDER BY clauses on different columns. This is minimally invasive and effectively disambiguates index selection.
  2. Monitor Query Plans: Use .eqp on in sqlite3 CLI or EXPLAIN QUERY PLAN in application logs to detect regressions.
  3. Precompute WHERE Values: Avoid runtime expressions in WHERE clauses to aid planner accuracy.
  4. Audit Indexes: Regularly review indexes for redundancy and alignment with query patterns.

By systematically applying these fixes, you can coerce SQLite’s query planner into making optimal index choices while maintaining the flexibility of programmatic query generation.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *