Unexpected Empty Result When Using ANALYZE with Complex Joins and Indexes

Unexpected Behavior in LEFT OUTER JOIN Results After ANALYZE Execution

Issue Overview

The core problem revolves around a SELECT query that returns an empty result set instead of a NULL value when executed under specific conditions involving table statistics collection via the ANALYZE command. The scenario involves three tables (t0, t1, t2) with unconventional indexes, data types, and JOIN conditions. The query structure includes a chain of LEFT OUTER JOIN operations and a WHERE clause that filters on a column from the rightmost table (t2.c0).

Key Observations

  1. Schema and Data Setup

    • t0 is a temporary table with a TEXT column c0 containing the value 'DM'.
    • t1 is a table with an INT column c0 populated with mixed-type values (strings, integers, NULL).
    • t2 is a table with a REAL column c0 initialized to a BLOB value (x''), later updated to NULL.
    • Indexes i54 (on t0) and i20 (on t2) use complex expressions involving bitwise operations, BETWEEN clauses with invalid operands (e.g., x''), and COLLATE RTRIM.
  2. Query Structure
    The problematic query:

    SELECT ALL t1.c0  
    FROM t0  
    LEFT OUTER JOIN t1  
    LEFT OUTER JOIN t2 ON (t1.c0 NOT NULL) == (t2.c0 IN ())  
    WHERE t2.c0;  
    
    • The JOIN condition compares t1.c0 NOT NULL (a boolean expression) to t2.c0 IN () (an empty list, which always evaluates to FALSE).
    • The WHERE t2.c0 clause filters rows where t2.c0 is truthy (non-zero and non-NULL).
  3. Behavioral Discrepancy

    • Case 1 (Without ANALYZE): The query returns no rows because t2.c0 is NULL (due to the UPDATE statement), and the WHERE t2.c0 clause filters out all rows. However, the user expected a NULL result, likely due to misunderstanding how LEFT JOIN interacts with WHERE clauses.
    • Case 2 (With ANALYZE): After running ANALYZE t0 and ANALYZE t1, the query returns an empty result set. The user interpreted this as a bug, expecting NULL instead of an empty output.
  4. Version-Specific Fix
    The SQLite maintainer confirmed the issue was resolved in versions after 3.41.1 (specifically in trunk and branch-3.41). The root cause was a query planner bug triggered by stale or incorrect statistics collected via ANALYZE.


Root Causes Behind Incorrect Query Results Post-ANALYZE

  1. Query Planner Sensitivity to Table Statistics
    The ANALYZE command generates statistical metadata about table contents and index selectivity. These statistics influence the query planner’s decisions about join order, index usage, and predicate evaluation. In this case:

    • The WHERE clause (t2.c0) depends on t2 being part of the JOIN sequence.
    • After ANALYZE, outdated or misleading statistics caused the planner to incorrectly optimize the JOIN order or predicate evaluation, leading to an empty result.
  2. Index-Driven Optimization Pitfalls
    The indexes i54 and i20 involve non-deterministic or invalid expressions:

    • i54 includes ((c0) & (c0)) NOT BETWEEN ('?~') AND (x''), which combines bitwise operations with a BETWEEN clause comparing a TEXT value ('?~') to a BLOB (x''). This index is effectively unused due to its WHERE clause ((NULL) BETWEEN (c0) AND (c0)), which always evaluates to NULL (equivalent to FALSE).
    • i20 uses ((c0 OR 0.488...) OR c0) IN (), which simplifies to FALSE.

    These indexes confuse the query planner by introducing synthetic columns with hard-coded FALSE or NULL values, skewing statistics and causing incorrect index selections.

  3. NULL Handling in JOIN Conditions and WHERE Clauses

    • The LEFT OUTER JOIN t2 condition (t1.c0 NOT NULL) == (t2.c0 IN ()) resolves to t1.c0 NOT NULL == FALSE, or equivalently t1.c0 IS NULL.
    • The WHERE t2.c0 clause filters out all rows where t2.c0 is NULL or 0. Since t2.c0 was set to NULL via UPDATE, no rows satisfy this condition.
    • The expectation of a NULL result stemmed from misunderstanding how LEFT JOIN preserves unmatched rows from the left table but applies WHERE clause filtering after the JOIN.
  4. CLI Output Formatting Misinterpretation
    SQLite’s command-line interface (CLI) does not display NULL values as the string "NULL" but instead shows an empty string. This led to confusion between an actual NULL value (stored in the result set) and an empty output (no rows).


Resolving Empty Results by Addressing Query Plan and Version-Specific Bugs

Step 1: Verify SQLite Version and Apply Updates

  • Check Version:
    SELECT sqlite_version();  
    

    Ensure the version is 3.41.1 or newer. Versions prior to 3.41.1 contain the query planner bug fixed in later releases.

  • Update SQLite:
    Download the latest precompiled binaries or build from source:

    wget https://www.sqlite.org/src/tarball/sqlite.tar.gz  
    tar xzf sqlite.tar.gz  
    cd sqlite  
    ./configure && make  
    

Step 2: Analyze Query Logic and Data Flow

  • Simplify the Query:
    Break down the JOIN sequence and WHERE clause:

    -- Original JOIN logic rewritten for clarity  
    SELECT t1.c0  
    FROM t0  
    LEFT JOIN (  
      t1  
      LEFT JOIN t2 ON (t1.c0 IS NULL)  
    ) ON TRUE  
    WHERE t2.c0;  
    
    • The t2 JOIN condition simplifies to t1.c0 IS NULL.
    • The WHERE t2.c0 clause excludes rows where t2.c0 is NULL or 0.
  • Validate Data States:

    • Confirm t2.c0 is NULL after the UPDATE:
      SELECT c0 FROM t2;  
      
    • Check t1.c0 values for NULL:
      SELECT c0 FROM t1 WHERE c0 IS NULL;  
      

Step 3: Disable or Correct Indexes

  • Drop Problematic Indexes:

    DROP INDEX i54;  
    DROP INDEX i20;  
    

    Re-run the query to see if the result changes.

  • Rewrite Index Definitions:
    Avoid expressions with hard-coded FALSE or NULL conditions. For example:

    CREATE INDEX i20 ON t2(c0);  -- Simple index on t2.c0  
    

Step 4: Bypass ANALYZE-Induced Issues

  • Reset Statistics:
    Clear existing statistics to force the query planner to recalculate:

    ANALYZE sqlite_schema;  -- Updates internal schema metadata  
    
  • Use Query-Specific Optimizer Hints:
    Force a specific JOIN order using CROSS JOIN or INDEXED BY:

    SELECT ALL t1.c0  
    FROM t0  
    LEFT OUTER JOIN t1  
    LEFT OUTER JOIN t2 INDEXED BY i20 ON ...  
    

Step 5: Correct NULL Handling and Output Display

  • Explicitly Handle NULLs:
    Use COALESCE or IFNULL to distinguish NULL from empty strings:

    SELECT ALL COALESCE(t1.c0, 'NULL') AS c0  
    FROM ...  
    
  • Adjust CLI Settings:
    Configure the CLI to display NULL explicitly:

    .nullvalue NULL  
    

Step 6: Validate with EXPLAIN and EXPLAIN QUERY PLAN

  • Inspect Query Execution Steps:

    EXPLAIN QUERY PLAN  
    SELECT ALL t1.c0 FROM t0 LEFT OUTER JOIN t1 ...;  
    
    • Look for full table scans vs index usage.
    • Verify JOIN order matches expectations.
  • Compare Pre- and Post-ANALYZE Plans:
    Run EXPLAIN before and after ANALYZE to identify differences in opcode sequences.

Step 7: Test with Simplified Schema and Data

  • Reproduce with Minimal Components:
    Create a minimal test case:

    CREATE TABLE t1 (c0 INT);  
    INSERT INTO t1 VALUES (NULL);  
    CREATE TABLE t2 (c0 REAL);  
    INSERT INTO t2 VALUES (NULL);  
    SELECT t1.c0  
    FROM t1  
    LEFT JOIN t2 ON t1.c0 IS NULL  
    WHERE t2.c0;  
    
    • This isolates the JOIN logic and WHERE clause.

Step 8: Review SQLite’s NULL and Boolean Logic

  • Understand Implicit Type Conversions:

    • t2.c0 IN () evaluates to FALSE regardless of t2.c0.
    • t1.c0 NOT NULL evaluates to TRUE or FALSE, but in SQLite, TRUE is 1 and FALSE is 0.
    • The comparison (t1.c0 NOT NULL) == (t2.c0 IN ()) becomes 1 == 0 or 0 == 0, depending on t1.c0.
  • Adjust Boolean Comparisons:
    Use explicit IS TRUE or IS FALSE to avoid integer coercion:

    ON (t1.c0 NOT NULL) IS (t2.c0 IN ())  
    

Step 9: Consult SQLite’s Query Planner Documentation

Step 10: Report Edge Cases to SQLite Maintainers

  • Contribute to Testing:
    If the issue persists after all fixes, submit a reproducible test case to the SQLite team via their bug tracker.

This guide systematically addresses the interplay between query planner optimizations, index design, and NULL handling in SQLite. By methodically isolating each component, developers can diagnose similar issues arising from complex JOINs and statistical metadata.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *