Incorrect COUNT Results with DISTINCT and UNION ALL in SQLite
Issue Overview: DISTINCT Subquery Combined with UNION ALL Yields Unexpected Row Counts
The core issue arises when combining a SELECT DISTINCT
subquery with a UNION ALL
operation in SQLite versions 3.42.0 through 3.43.0. This combination can produce incorrect row counts when using aggregate functions like COUNT(1)
or count()
on the result set. The problem is particularly evident in queries where one branch of a UNION ALL
includes a DISTINCT
modifier while the other branch does not. For example:
SELECT COUNT(1) FROM (
SELECT DISTINCT * FROM t1
UNION ALL
SELECT * FROM t2
);
In this scenario, the outer COUNT(1)
returns an incorrect value (e.g., 2 instead of 3) even though the individual rows returned by the subquery are correct. The discrepancy occurs because SQLite’s query planner applies an optimization that inadvertently skips the computation of unused columns in the non-DISTINCT
branch of the UNION ALL
. This optimization, introduced in SQLite 3.42.0, was designed to improve performance by omitting unnecessary columns from subqueries. However, it fails to account for cases where a UNION ALL
combines DISTINCT
and non-DISTINCT
subqueries, leading to incorrect results in aggregate calculations.
The issue is rooted in the interaction between two components of SQLite’s query planner:
- The
DISTINCT
keyword, which ensures that all columns in the result set are used to determine uniqueness. - The
UNION ALL
operator, which concatenates results from two subqueries without removing duplicates.
When these are combined, SQLite’s optimizer incorrectly assumes that columns in the non-DISTINCT
branch of the UNION ALL
are not required for the outer query’s computation. This assumption breaks the integrity of the result set when counting rows or performing other aggregate operations.
Possible Causes: Optimization Conflict in UNION ALL with Mixed DISTINCT and Non-DISTINCT Subqueries
1. Omit-Unused-Subquery-Columns Optimization
Introduced in SQLite 3.42.0 (check-in 7c2d3406000dc8a), this optimization aims to reduce computational overhead by skipping the evaluation of columns in subqueries that are not referenced in the outer query. For example:
SELECT a FROM (SELECT a, b, c FROM t1);
Here, columns b
and c
are omitted from the subquery since they are unused. However, this optimization is invalid if the subquery includes DISTINCT
, as all columns are required to enforce uniqueness. The query planner includes a guard clause to disable the optimization for DISTINCT
subqueries.
The Flaw: The guard clause fails to account for UNION ALL
subqueries where only one branch is DISTINCT
. For instance:
SELECT COUNT(1) FROM (
SELECT DISTINCT * FROM t1 -- DISTINCT branch
UNION ALL
SELECT * FROM t2 -- Non-DISTINCT branch
);
The optimizer incorrectly applies column omission to the non-DISTINCT
branch (assuming its columns are unused), even though the DISTINCT
branch requires all columns. This results in a mismatch between the actual rows and the computed count.
2. Incorrect Handling of Column Dependencies in UNION ALL
UNION ALL
combines results from two subqueries, requiring both branches to have the same number of columns. The optimizer’s column omission logic violates this requirement when applied asymmetrically to the branches. In the example above, the DISTINCT
branch retains all columns (as required), while the non-DISTINCT
branch drops columns. This creates an internal inconsistency in the result set’s structure, leading to incorrect counts.
3. Version-Specific Regression
The issue is absent in SQLite versions prior to 3.42.0. The regression was caused by the introduction of the omit-unused-subquery-columns optimization, which was not fully tested against mixed DISTINCT
/non-DISTINCT
UNION ALL
subqueries. The optimization works correctly in simpler cases but fails in this edge case.
Troubleshooting Steps, Solutions & Fixes: Diagnosing and Resolving Incorrect COUNT with UNION ALL and DISTINCT
Step 1: Confirm the SQLite Version
Check the SQLite version using SELECT sqlite_version();
. If the version is 3.42.0, 3.43.0, or any version built after check-in 7c2d3406000dc8a and before c84d5602ac9bfb4f, the issue is likely present. Versions 3.41.2 and earlier are unaffected.
Step 2: Reproduce the Issue with a Minimal Test Case
Create a minimal test case to isolate the problem:
CREATE TABLE t1(id TEXT);
INSERT INTO t1 VALUES('a'), ('b');
CREATE TABLE t2(id TEXT);
INSERT INTO t2 VALUES('c');
-- Problematic query
SELECT COUNT(1) FROM (
SELECT DISTINCT * FROM t1
UNION ALL
SELECT * FROM t2
);
Expected Result: 3 (rows: ‘a’, ‘b’, ‘c’)
Actual Result: 2
If the result is incorrect, proceed to the next steps.
Step 3: Apply the Official Patch or Upgrade SQLite
The fix was implemented in check-in c84d5602ac9bfb4f. To resolve the issue:
- Upgrade SQLite to a version that includes the fix (3.43.1 or later).
- Recompile SQLite with the patched source code if upgrading is not feasible.
Step 4: Workarounds for Unpatched Versions
If upgrading is not possible, modify the query to avoid triggering the faulty optimization:
Workaround 1: Add a Dummy Column to the Non-DISTINCT Branch
Force the optimizer to retain all columns by referencing them in the outer query:
SELECT COUNT(1) FROM (
SELECT DISTINCT * FROM t1
UNION ALL
SELECT id, NULL AS dummy FROM t2 -- Add unused column
);
Workaround 2: Use UNION Instead of UNION ALL
Replace UNION ALL
with UNION
to disable the optimization (note that this removes duplicates across branches):
SELECT COUNT(1) FROM (
SELECT DISTINCT * FROM t1
UNION
SELECT * FROM t2
);
Workaround 3: Apply DISTINCT to Both Branches
Ensure both branches of the UNION ALL
are DISTINCT
:
SELECT COUNT(1) FROM (
SELECT DISTINCT * FROM t1
UNION ALL
SELECT DISTINCT * FROM t2
);
Step 5: Verify the Fix
After applying the patch or workaround, re-run the test case to confirm the correct count:
SELECT COUNT(1) FROM (
SELECT DISTINCT * FROM t1
UNION ALL
SELECT * FROM t2
);
Expected Result: 3
Step 6: Review Existing Queries for Similar Patterns
Audit existing SQL queries for combinations of DISTINCT
and UNION ALL
where one branch is DISTINCT
and the other is not. Update these queries using the workarounds above if necessary.
Step 7: Monitor Future SQLite Releases
Subscribe to SQLite’s changelog to stay informed about fixes and regressions. Ensure that future upgrades include thorough testing of edge cases involving query optimizations.
By following these steps, developers can resolve the incorrect row count issue and prevent similar problems in the future. The root cause—over-optimization in mixed DISTINCT
/non-DISTINCT
UNION ALL
subqueries—highlights the importance of rigorous testing when applying query planner optimizations.