Incorrect COUNT Results with DISTINCT and UNION ALL in SQLite

Issue Overview: DISTINCT Subquery Combined with UNION ALL Yields Unexpected Row Counts

The core issue arises when combining a SELECT DISTINCT subquery with a UNION ALL operation in SQLite versions 3.42.0 through 3.43.0. This combination can produce incorrect row counts when using aggregate functions like COUNT(1) or count() on the result set. The problem is particularly evident in queries where one branch of a UNION ALL includes a DISTINCT modifier while the other branch does not. For example:

SELECT COUNT(1) FROM (
  SELECT DISTINCT * FROM t1 
  UNION ALL 
  SELECT * FROM t2
);

In this scenario, the outer COUNT(1) returns an incorrect value (e.g., 2 instead of 3) even though the individual rows returned by the subquery are correct. The discrepancy occurs because SQLite’s query planner applies an optimization that inadvertently skips the computation of unused columns in the non-DISTINCT branch of the UNION ALL. This optimization, introduced in SQLite 3.42.0, was designed to improve performance by omitting unnecessary columns from subqueries. However, it fails to account for cases where a UNION ALL combines DISTINCT and non-DISTINCT subqueries, leading to incorrect results in aggregate calculations.

The issue is rooted in the interaction between two components of SQLite’s query planner:

  1. The DISTINCT keyword, which ensures that all columns in the result set are used to determine uniqueness.
  2. The UNION ALL operator, which concatenates results from two subqueries without removing duplicates.

When these are combined, SQLite’s optimizer incorrectly assumes that columns in the non-DISTINCT branch of the UNION ALL are not required for the outer query’s computation. This assumption breaks the integrity of the result set when counting rows or performing other aggregate operations.

Possible Causes: Optimization Conflict in UNION ALL with Mixed DISTINCT and Non-DISTINCT Subqueries

1. Omit-Unused-Subquery-Columns Optimization

Introduced in SQLite 3.42.0 (check-in 7c2d3406000dc8a), this optimization aims to reduce computational overhead by skipping the evaluation of columns in subqueries that are not referenced in the outer query. For example:

SELECT a FROM (SELECT a, b, c FROM t1);

Here, columns b and c are omitted from the subquery since they are unused. However, this optimization is invalid if the subquery includes DISTINCT, as all columns are required to enforce uniqueness. The query planner includes a guard clause to disable the optimization for DISTINCT subqueries.

The Flaw: The guard clause fails to account for UNION ALL subqueries where only one branch is DISTINCT. For instance:

SELECT COUNT(1) FROM (
  SELECT DISTINCT * FROM t1  -- DISTINCT branch
  UNION ALL
  SELECT * FROM t2           -- Non-DISTINCT branch
);

The optimizer incorrectly applies column omission to the non-DISTINCT branch (assuming its columns are unused), even though the DISTINCT branch requires all columns. This results in a mismatch between the actual rows and the computed count.

2. Incorrect Handling of Column Dependencies in UNION ALL

UNION ALL combines results from two subqueries, requiring both branches to have the same number of columns. The optimizer’s column omission logic violates this requirement when applied asymmetrically to the branches. In the example above, the DISTINCT branch retains all columns (as required), while the non-DISTINCT branch drops columns. This creates an internal inconsistency in the result set’s structure, leading to incorrect counts.

3. Version-Specific Regression

The issue is absent in SQLite versions prior to 3.42.0. The regression was caused by the introduction of the omit-unused-subquery-columns optimization, which was not fully tested against mixed DISTINCT/non-DISTINCT UNION ALL subqueries. The optimization works correctly in simpler cases but fails in this edge case.

Troubleshooting Steps, Solutions & Fixes: Diagnosing and Resolving Incorrect COUNT with UNION ALL and DISTINCT

Step 1: Confirm the SQLite Version

Check the SQLite version using SELECT sqlite_version();. If the version is 3.42.0, 3.43.0, or any version built after check-in 7c2d3406000dc8a and before c84d5602ac9bfb4f, the issue is likely present. Versions 3.41.2 and earlier are unaffected.

Step 2: Reproduce the Issue with a Minimal Test Case

Create a minimal test case to isolate the problem:

CREATE TABLE t1(id TEXT);
INSERT INTO t1 VALUES('a'), ('b');
CREATE TABLE t2(id TEXT);
INSERT INTO t2 VALUES('c');

-- Problematic query
SELECT COUNT(1) FROM (
  SELECT DISTINCT * FROM t1 
  UNION ALL 
  SELECT * FROM t2
);

Expected Result: 3 (rows: ‘a’, ‘b’, ‘c’)
Actual Result: 2

If the result is incorrect, proceed to the next steps.

Step 3: Apply the Official Patch or Upgrade SQLite

The fix was implemented in check-in c84d5602ac9bfb4f. To resolve the issue:

  1. Upgrade SQLite to a version that includes the fix (3.43.1 or later).
  2. Recompile SQLite with the patched source code if upgrading is not feasible.

Step 4: Workarounds for Unpatched Versions

If upgrading is not possible, modify the query to avoid triggering the faulty optimization:

Workaround 1: Add a Dummy Column to the Non-DISTINCT Branch

Force the optimizer to retain all columns by referencing them in the outer query:

SELECT COUNT(1) FROM (
  SELECT DISTINCT * FROM t1 
  UNION ALL 
  SELECT id, NULL AS dummy FROM t2  -- Add unused column
);

Workaround 2: Use UNION Instead of UNION ALL

Replace UNION ALL with UNION to disable the optimization (note that this removes duplicates across branches):

SELECT COUNT(1) FROM (
  SELECT DISTINCT * FROM t1 
  UNION 
  SELECT * FROM t2
);

Workaround 3: Apply DISTINCT to Both Branches

Ensure both branches of the UNION ALL are DISTINCT:

SELECT COUNT(1) FROM (
  SELECT DISTINCT * FROM t1 
  UNION ALL 
  SELECT DISTINCT * FROM t2
);

Step 5: Verify the Fix

After applying the patch or workaround, re-run the test case to confirm the correct count:

SELECT COUNT(1) FROM (
  SELECT DISTINCT * FROM t1 
  UNION ALL 
  SELECT * FROM t2
);

Expected Result: 3

Step 6: Review Existing Queries for Similar Patterns

Audit existing SQL queries for combinations of DISTINCT and UNION ALL where one branch is DISTINCT and the other is not. Update these queries using the workarounds above if necessary.

Step 7: Monitor Future SQLite Releases

Subscribe to SQLite’s changelog to stay informed about fixes and regressions. Ensure that future upgrades include thorough testing of edge cases involving query optimizations.

By following these steps, developers can resolve the incorrect row count issue and prevent similar problems in the future. The root cause—over-optimization in mixed DISTINCT/non-DISTINCT UNION ALL subqueries—highlights the importance of rigorous testing when applying query planner optimizations.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *