Index on Expression Causes Incorrect GROUP BY Results in SQLite 3.41.x

Expression Indexes, GROUP BY Aggregation, and Query Planner Misalignment

Root Cause: Expression Index Optimization Interferes with GROUP BY Logic

The core issue arises from SQLite’s query planner misapplying expression index optimizations during the execution of queries involving GROUP BY clauses referencing derived columns. Specifically, when an index is defined on an expression (e.g., -tag), the query planner may incorrectly reuse the indexed expression’s precomputed value in contexts where the original column’s logical relationship to other tables is altered. This misalignment causes aggregation logic (e.g., MAX(m.value)) to bind to incorrect rows, producing invalid results.

The problem manifests when two conditions coincide:

  1. A non-deterministic or context-sensitive expression (though -tag is deterministic) is indexed.
  2. A subquery or join references the same expression in a GROUP BY clause while joining to another table that shares a logical relationship with the indexed column.

In the reported example, the user(-tag) index stores precomputed -tag values, which the query planner attempts to leverage to optimize the GROUP BY -tag subquery. However, when the user table is joined to marker in the second subquery, the planner incorrectly assumes that the indexed -tag expression can be reused without recalculating the relationship between u.tag and m.user_id. This results in the aggregation (MAX(m.value)) being applied to the wrong subset of rows, as the indexed expression’s context is severed from the join conditions.

Key Failure Modes in Query Planner and Index Utilization

Three primary failure modes contribute to this bug:

  1. Prematerialization of Expression Results:
    The index on -tag materializes negative tag values, which the planner uses to shortcut the GROUP BY -tag operation. However, the materialized values are treated as standalone data rather than expressions derived from tag. When the user table is joined to marker, the planner fails to recognize that u.tag in the join condition (v.tag = u.tag) depends on the original tag column, not the indexed -tag expression. This disconnects the aggregation scope from the join, causing MAX(m.value) to bind to incorrect rows.

  2. Loss of Column-Expression Association in Subquery Flattening:
    SQLite’s query optimizer often flattens subqueries into the main query to eliminate temporary tables. In this case, flattening the (SELECT tag FROM user GROUP BY -tag) subquery causes the planner to deduplicate rows using the indexed -tag values. However, the deduplication process discards the original tag values required for the join with v.tag, leading to mismatched keys.

  3. Incorrect Cost Estimation for Index Scans vs. Full Table Scans:
    The presence of the expression index biases the planner toward using an index scan for the GROUP BY -tag operation, assuming it is cheaper than a full table scan. However, the index scan skips the step of resolving tag values from the base table, which are needed for subsequent joins. The planner’s cost model does not account for the hidden cost of losing access to the original tag column during later query stages.

Diagnostic Workflow, Mitigations, and Permanent Solutions

Step 1: Confirm SQLite Version and Index Dependencies

  • Check SQLite version using SELECT sqlite_version();. Versions 3.41.0–3.41.2 are affected.
  • Identify expression indexes involved in queries with joins and aggregations:
    SELECT name, sql FROM sqlite_master WHERE type = 'index' AND sql LIKE '%(%'; 
    

    Look for indexes containing expressions (e.g., -tag).

Step 2: Analyze Query Plans With and Without the Expression Index

  • Run EXPLAIN QUERY PLAN with the index enabled:
    EXPLAIN QUERY PLAN
    SELECT u.tag, v.max_value ... [rest of query];
    

    Note if the SCAN user USING INDEX "user(-tag)" or similar appears in the output.

  • Drop the index and rerun EXPLAIN QUERY PLAN:
    DROP INDEX "user(-tag)";
    EXPLAIN QUERY PLAN
    SELECT u.tag, v.max_value ... [rest of query];
    

    Observe if the plan switches to SCAN user (full table scan).

Step 3: Force Temporary Table Materialization for Subqueries

  • Modify the query to prevent subquery flattening by adding redundant clauses like LIMIT -1:
    SELECT u.tag, v.max_value
    FROM (SELECT tag FROM user GROUP BY -tag LIMIT -1) u
    JOIN (...) v ON ...;
    

    This forces the subquery to materialize its results in a temporary table, decoupling the index scan from the join logic.

Step 4: Rewrite GROUP BY to Reference Base Columns

  • Avoid grouping by expressions that have dedicated indexes. Instead, group by the base column and apply the expression in the SELECT list:
    SELECT -tag AS neg_tag, ... 
    FROM user 
    GROUP BY tag; -- Instead of GROUP BY -tag
    

    This prevents the planner from using the expression index for grouping.

Step 5: Update SQLite or Backport the Query Planner Fix

  • Upgrade to SQLite 3.41.3 or newer, which includes the fix.
  • If upgrading is impossible, backport the fix by modifying the query planner’s handling of expression indexes in GROUP BY contexts. The core fix involves adding a check to ensure that expressions used in indexes are not treated as standalone columns during subquery flattening and join optimization.

Step 6: Use Index Hints to Bypass the Faulty Optimization

  • Force the planner to ignore the expression index using INDEXED BY:
    SELECT u.tag, v.max_value
    FROM (SELECT tag FROM user INDEXED BY <nonexistent_index> GROUP BY -tag) u
    JOIN (...) v ON ...;
    

    This is a hack but can be effective if no other indexes exist on user.

Final Solution: Expression Index Design and Query Pattern Audits

  • Audit all expression indexes for potential context sensitivity. Avoid indexing expressions that are reused in joins or aggregations unless the indexed expression is immutable and independent of table relationships.
  • Rewrite queries to isolate expression usage within atomic subqueries, preventing the planner from conflating indexed values with base column relationships.

By systematically addressing the interplay between expression indexes, GROUP BY logic, and join conditions, developers can resolve this class of bugs and prevent regressions in future SQLite deployments.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *