Unexpected GROUP BY Behavior with Covering Indexes in Ambiguous SQLite Queries
1. Core Mechanics of Ambiguous GROUP BY Results in Multi-Table Joins
The problem centers around non-deterministic row selection during GROUP BY operations when joining tables with covering indexes. This manifests as inconsistent query outputs depending on whether an index is used to resolve the join.
Schema & Data Context
Two tables are involved:
t0
: Contains a single columnc0
with a text value ('DEADBEEF'
).t1
: Has two columns:c0
(UNIQUE constraint, with values'a'
and'1000000'
).c1
(contains1
andNULL
).
The Cartesian join between t1
and t0
produces two rows:
t1.c0 | t1.c1 | t0.c0
----------------------
a | 1 | DEADBEEF
1000000 | NULL | DEADBEEF
The GROUP BY t0.c0
collapses these two rows into one, but SQLite must arbitrarily select either of the two rows. The HAVING t1.c0
clause then filters the result based on the truthiness of t1.c0
from the selected row.
Observed Behavior
- With covering index (autoindex on
t1.c0
): The query returns'DEADBEEF'
. - Without index (
NOT INDEXED
): The query returns an empty set.
This discrepancy arises because the index scan order vs. table scan order changes which row from the join is retained during grouping.
Critical Factors
- Implicit Join Order: The sequence of table scans (
t0
first vs.t1
first) affects row processing order. - Covering Index Optimization: Indexes allow direct retrieval of
t1.c0
values without accessing the base table, altering the effective row order. - GROUP BY Ambiguity: SQL standards permit databases to discard non-aggregated/non-grouped columns arbitrarily when collapsing rows.
2. Root Causes of Non-Deterministic GROUP BY Outcomes
1. Absence of Deterministic Row Selection in GROUP BY
When multiple rows qualify for a single group, SQLite (and most SQL engines) do not guarantee which row’s non-aggregated columns are retained. This is not a bug but an intentional allowance in the SQL standard.
Example:
- If a group has two rows with
t0.c0 = 'DEADBEEF'
andt1.c0 = 'a'
or'1000000'
, SQLite may pick eithert1.c0
value.
2. Index-Driven Query Plan Variations
Covering indexes alter the physical execution order of joins and row retrieval:
- Index Scan: Retrieves
t1.c0
values in index order (e.g., ascending sorted keys). - Table Scan: Retrieves
t1
rows in physical storage order (often insertion order).
This difference changes which t1.c0
value is first encountered during grouping, influencing which row is retained.
3. HAVING Clause Dependency on Unstable Column Values
The HAVING t1.c0
condition introduces a post-grouping filter that depends on the retained row’s t1.c0
value. If the retained row has t1.c0 = 'a'
, the filter passes (non-zero/non-NULL). If t1.c0 = '1000000'
, the filter fails (since '1000000'
is treated as a non-zero text value, but the row’s t1.c1
is NULL
, which might not directly affect t1.c0
’s truthiness).
Key Insight: The HAVING clause amplifies the ambiguity of the GROUP BY by making the query result hinge on an unstable column value.
3. Resolving Ambiguity: Strategies for Consistent Query Results
Step 1: Diagnose the Query Plan
Use EXPLAIN QUERY PLAN
to compare index usage and join order:
EXPLAIN QUERY PLAN
SELECT t0.c0 FROM t1, t0 GROUP BY t0.c0 HAVING t1.c0;
Look for USING COVERING INDEX
clauses. If present, the index is altering row retrieval order.
Step 2: Eliminate GROUP BY Ambiguity
Force deterministic row selection with MIN()
/MAX()
or subqueries:
SELECT t0.c0
FROM (
SELECT t1.c0, t0.c0
FROM t1, t0
ORDER BY t1.c0 -- Explicit order ensures consistency
LIMIT 1
)
GROUP BY t0.c0
HAVING t1.c0;
Step 3: Avoid Implicit Joins with Unstable Columns
Rewrite the query to use explicit joins and filter criteria before grouping:
SELECT t0.c0
FROM t0
LEFT JOIN t1 ON true -- Explicit cross join
WHERE t1.c0 IS NOT NULL
GROUP BY t0.c0;
Step 4: Materialize Intermediate Results
Use a temporary table or CTE to stabilize the join result before grouping:
WITH JoinedData AS (
SELECT t1.c0 AS t1_c0, t0.c0 AS t0_c0
FROM t1, t0
ORDER BY t1.c0 -- Force index-like ordering
)
SELECT t0_c0
FROM JoinedData
GROUP BY t0_c0
HAVING t1_c0;
Step 5: Schema Adjustments for Determinism
Add columns to t1
to break ties during grouping:
ALTER TABLE t1 ADD COLUMN row_priority INTEGER DEFAULT 1;
-- Set row_priority to control which row is retained
SELECT t0.c0
FROM t1, t0
GROUP BY t0.c0
HAVING t1.c0
ORDER BY t1.row_priority
LIMIT 1;
Step 6: Disable Indexes for Predictable Scans
Force table scans using NOT INDEXED
hints, but note that this sacrifices performance:
SELECT t0.c0 FROM t1 NOT INDEXED, t0 GROUP BY t0.c0 HAVING t1.c0;
Step 7: Use Aggregates in HAVING Clauses
Reference aggregated values instead of raw columns:
SELECT t0.c0
FROM t1, t0
GROUP BY t0.c0
HAVING MAX(t1.c0) IS NOT NULL; -- Aggregates resolve ambiguity
Final Recommendation
The root issue stems from relying on non-deterministic column values after grouping. To guarantee consistent results:
- Use explicit
ORDER BY
before grouping. - Avoid HAVING clauses that depend on non-aggregated columns.
- Prefer
WHERE
filters overHAVING
when possible. - Design schemas with tiebreaker columns for critical groupings.
By formalizing row selection logic and isolating index-driven optimizations, developers can reconcile the differences between indexed and non-indexed query execution paths.