Unexpected GROUP BY Behavior with Covering Indexes in Ambiguous SQLite Queries


1. Core Mechanics of Ambiguous GROUP BY Results in Multi-Table Joins

The problem centers around non-deterministic row selection during GROUP BY operations when joining tables with covering indexes. This manifests as inconsistent query outputs depending on whether an index is used to resolve the join.

Schema & Data Context

Two tables are involved:

  • t0: Contains a single column c0 with a text value ('DEADBEEF').
  • t1: Has two columns:
    • c0 (UNIQUE constraint, with values 'a' and '1000000').
    • c1 (contains 1 and NULL).

The Cartesian join between t1 and t0 produces two rows:

t1.c0 | t1.c1 | t0.c0  
----------------------  
a     | 1     | DEADBEEF  
1000000 | NULL | DEADBEEF  

The GROUP BY t0.c0 collapses these two rows into one, but SQLite must arbitrarily select either of the two rows. The HAVING t1.c0 clause then filters the result based on the truthiness of t1.c0 from the selected row.

Observed Behavior

  • With covering index (autoindex on t1.c0): The query returns 'DEADBEEF'.
  • Without index (NOT INDEXED): The query returns an empty set.

This discrepancy arises because the index scan order vs. table scan order changes which row from the join is retained during grouping.

Critical Factors

  1. Implicit Join Order: The sequence of table scans (t0 first vs. t1 first) affects row processing order.
  2. Covering Index Optimization: Indexes allow direct retrieval of t1.c0 values without accessing the base table, altering the effective row order.
  3. GROUP BY Ambiguity: SQL standards permit databases to discard non-aggregated/non-grouped columns arbitrarily when collapsing rows.

2. Root Causes of Non-Deterministic GROUP BY Outcomes

1. Absence of Deterministic Row Selection in GROUP BY

When multiple rows qualify for a single group, SQLite (and most SQL engines) do not guarantee which row’s non-aggregated columns are retained. This is not a bug but an intentional allowance in the SQL standard.

Example:

  • If a group has two rows with t0.c0 = 'DEADBEEF' and t1.c0 = 'a' or '1000000', SQLite may pick either t1.c0 value.

2. Index-Driven Query Plan Variations

Covering indexes alter the physical execution order of joins and row retrieval:

  • Index Scan: Retrieves t1.c0 values in index order (e.g., ascending sorted keys).
  • Table Scan: Retrieves t1 rows in physical storage order (often insertion order).

This difference changes which t1.c0 value is first encountered during grouping, influencing which row is retained.

3. HAVING Clause Dependency on Unstable Column Values

The HAVING t1.c0 condition introduces a post-grouping filter that depends on the retained row’s t1.c0 value. If the retained row has t1.c0 = 'a', the filter passes (non-zero/non-NULL). If t1.c0 = '1000000', the filter fails (since '1000000' is treated as a non-zero text value, but the row’s t1.c1 is NULL, which might not directly affect t1.c0’s truthiness).

Key Insight: The HAVING clause amplifies the ambiguity of the GROUP BY by making the query result hinge on an unstable column value.


3. Resolving Ambiguity: Strategies for Consistent Query Results

Step 1: Diagnose the Query Plan

Use EXPLAIN QUERY PLAN to compare index usage and join order:

EXPLAIN QUERY PLAN  
SELECT t0.c0 FROM t1, t0 GROUP BY t0.c0 HAVING t1.c0;  

Look for USING COVERING INDEX clauses. If present, the index is altering row retrieval order.

Step 2: Eliminate GROUP BY Ambiguity

Force deterministic row selection with MIN()/MAX() or subqueries:

SELECT t0.c0  
FROM (  
  SELECT t1.c0, t0.c0  
  FROM t1, t0  
  ORDER BY t1.c0  -- Explicit order ensures consistency  
  LIMIT 1  
)  
GROUP BY t0.c0  
HAVING t1.c0;  

Step 3: Avoid Implicit Joins with Unstable Columns

Rewrite the query to use explicit joins and filter criteria before grouping:

SELECT t0.c0  
FROM t0  
LEFT JOIN t1 ON true  -- Explicit cross join  
WHERE t1.c0 IS NOT NULL  
GROUP BY t0.c0;  

Step 4: Materialize Intermediate Results

Use a temporary table or CTE to stabilize the join result before grouping:

WITH JoinedData AS (  
  SELECT t1.c0 AS t1_c0, t0.c0 AS t0_c0  
  FROM t1, t0  
  ORDER BY t1.c0  -- Force index-like ordering  
)  
SELECT t0_c0  
FROM JoinedData  
GROUP BY t0_c0  
HAVING t1_c0;  

Step 5: Schema Adjustments for Determinism

Add columns to t1 to break ties during grouping:

ALTER TABLE t1 ADD COLUMN row_priority INTEGER DEFAULT 1;  
-- Set row_priority to control which row is retained  
SELECT t0.c0  
FROM t1, t0  
GROUP BY t0.c0  
HAVING t1.c0  
ORDER BY t1.row_priority  
LIMIT 1;  

Step 6: Disable Indexes for Predictable Scans

Force table scans using NOT INDEXED hints, but note that this sacrifices performance:

SELECT t0.c0 FROM t1 NOT INDEXED, t0 GROUP BY t0.c0 HAVING t1.c0;  

Step 7: Use Aggregates in HAVING Clauses

Reference aggregated values instead of raw columns:

SELECT t0.c0  
FROM t1, t0  
GROUP BY t0.c0  
HAVING MAX(t1.c0) IS NOT NULL;  -- Aggregates resolve ambiguity  

Final Recommendation

The root issue stems from relying on non-deterministic column values after grouping. To guarantee consistent results:

  • Use explicit ORDER BY before grouping.
  • Avoid HAVING clauses that depend on non-aggregated columns.
  • Prefer WHERE filters over HAVING when possible.
  • Design schemas with tiebreaker columns for critical groupings.

By formalizing row selection logic and isolating index-driven optimizations, developers can reconcile the differences between indexed and non-indexed query execution paths.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *