Index on Expression Causes Incorrect GROUP BY Results in SQLite 3.41.x
Expression Indexes, GROUP BY Aggregation, and Query Planner Misalignment
Root Cause: Expression Index Optimization Interferes with GROUP BY Logic
The core issue arises from SQLite’s query planner misapplying expression index optimizations during the execution of queries involving GROUP BY clauses referencing derived columns. Specifically, when an index is defined on an expression (e.g., -tag
), the query planner may incorrectly reuse the indexed expression’s precomputed value in contexts where the original column’s logical relationship to other tables is altered. This misalignment causes aggregation logic (e.g., MAX(m.value)
) to bind to incorrect rows, producing invalid results.
The problem manifests when two conditions coincide:
- A non-deterministic or context-sensitive expression (though
-tag
is deterministic) is indexed. - A subquery or join references the same expression in a GROUP BY clause while joining to another table that shares a logical relationship with the indexed column.
In the reported example, the user(-tag)
index stores precomputed -tag
values, which the query planner attempts to leverage to optimize the GROUP BY -tag
subquery. However, when the user
table is joined to marker
in the second subquery, the planner incorrectly assumes that the indexed -tag
expression can be reused without recalculating the relationship between u.tag
and m.user_id
. This results in the aggregation (MAX(m.value)
) being applied to the wrong subset of rows, as the indexed expression’s context is severed from the join conditions.
Key Failure Modes in Query Planner and Index Utilization
Three primary failure modes contribute to this bug:
Prematerialization of Expression Results:
The index on-tag
materializes negativetag
values, which the planner uses to shortcut theGROUP BY -tag
operation. However, the materialized values are treated as standalone data rather than expressions derived fromtag
. When theuser
table is joined tomarker
, the planner fails to recognize thatu.tag
in the join condition (v.tag = u.tag
) depends on the originaltag
column, not the indexed-tag
expression. This disconnects the aggregation scope from the join, causingMAX(m.value)
to bind to incorrect rows.Loss of Column-Expression Association in Subquery Flattening:
SQLite’s query optimizer often flattens subqueries into the main query to eliminate temporary tables. In this case, flattening the(SELECT tag FROM user GROUP BY -tag)
subquery causes the planner to deduplicate rows using the indexed-tag
values. However, the deduplication process discards the originaltag
values required for the join withv.tag
, leading to mismatched keys.Incorrect Cost Estimation for Index Scans vs. Full Table Scans:
The presence of the expression index biases the planner toward using an index scan for theGROUP BY -tag
operation, assuming it is cheaper than a full table scan. However, the index scan skips the step of resolvingtag
values from the base table, which are needed for subsequent joins. The planner’s cost model does not account for the hidden cost of losing access to the originaltag
column during later query stages.
Diagnostic Workflow, Mitigations, and Permanent Solutions
Step 1: Confirm SQLite Version and Index Dependencies
- Check SQLite version using
SELECT sqlite_version();
. Versions 3.41.0–3.41.2 are affected. - Identify expression indexes involved in queries with joins and aggregations:
SELECT name, sql FROM sqlite_master WHERE type = 'index' AND sql LIKE '%(%';
Look for indexes containing expressions (e.g.,
-tag
).
Step 2: Analyze Query Plans With and Without the Expression Index
- Run
EXPLAIN QUERY PLAN
with the index enabled:EXPLAIN QUERY PLAN SELECT u.tag, v.max_value ... [rest of query];
Note if the
SCAN user USING INDEX "user(-tag)"
or similar appears in the output. - Drop the index and rerun
EXPLAIN QUERY PLAN
:DROP INDEX "user(-tag)"; EXPLAIN QUERY PLAN SELECT u.tag, v.max_value ... [rest of query];
Observe if the plan switches to
SCAN user
(full table scan).
Step 3: Force Temporary Table Materialization for Subqueries
- Modify the query to prevent subquery flattening by adding redundant clauses like
LIMIT -1
:SELECT u.tag, v.max_value FROM (SELECT tag FROM user GROUP BY -tag LIMIT -1) u JOIN (...) v ON ...;
This forces the subquery to materialize its results in a temporary table, decoupling the index scan from the join logic.
Step 4: Rewrite GROUP BY to Reference Base Columns
- Avoid grouping by expressions that have dedicated indexes. Instead, group by the base column and apply the expression in the SELECT list:
SELECT -tag AS neg_tag, ... FROM user GROUP BY tag; -- Instead of GROUP BY -tag
This prevents the planner from using the expression index for grouping.
Step 5: Update SQLite or Backport the Query Planner Fix
- Upgrade to SQLite 3.41.3 or newer, which includes the fix.
- If upgrading is impossible, backport the fix by modifying the query planner’s handling of expression indexes in GROUP BY contexts. The core fix involves adding a check to ensure that expressions used in indexes are not treated as standalone columns during subquery flattening and join optimization.
Step 6: Use Index Hints to Bypass the Faulty Optimization
- Force the planner to ignore the expression index using
INDEXED BY
:SELECT u.tag, v.max_value FROM (SELECT tag FROM user INDEXED BY <nonexistent_index> GROUP BY -tag) u JOIN (...) v ON ...;
This is a hack but can be effective if no other indexes exist on
user
.
Final Solution: Expression Index Design and Query Pattern Audits
- Audit all expression indexes for potential context sensitivity. Avoid indexing expressions that are reused in joins or aggregations unless the indexed expression is immutable and independent of table relationships.
- Rewrite queries to isolate expression usage within atomic subqueries, preventing the planner from conflating indexed values with base column relationships.
By systematically addressing the interplay between expression indexes, GROUP BY logic, and join conditions, developers can resolve this class of bugs and prevent regressions in future SQLite deployments.