Recursive Query BLOOM Filter Performance Degradation and UNION ALL View Index Utilization Regression

BLOOM Filter Overhead in Recursive CTE Queries and UNION ALL View Index Selection Failure

1. Core Mechanics of Recursive CTE Execution and View-Based Index Utilization

The problem manifests in two distinct but related phases involving SQLite’s query optimizer decisions. First, the recursive Common Table Expression (CTE) with BLOOM filter usage demonstrates severe performance degradation compared to its non-BLOOM-filtered counterpart. Second, after addressing the BLOOM filter issue, a regression occurs where a UNION ALL view fails to utilize available indexes on underlying tables. These phenomena stem from SQLite’s cost-based optimizer making suboptimal decisions about filter application and predicate pushing.

Recursive CTE Execution Dynamics

Recursive CTEs in SQLite operate through a two-phase process:

  1. Initial Setup Phase: Executes the base query (non-recursive term)
  2. Recursive Expansion Phase: Iteratively processes the recursive term until no new rows are added

In the provided query:

WITH RECURSIVE transit(x) AS (
  SELECT s FROM objs WHERE p=9 AND o=32805
  UNION 
  SELECT objs.s FROM objs, transit WHERE objs.p=9 AND objs.o=transit.x
)
SELECT x FROM transit;

The optimizer generates a BLOOM filter during the recursive step to reduce unnecessary row comparisons. A BLOOM filter acts as a probabilistic data structure that tests whether an element belongs to a set. When applied to objs.o=transit.x, it attempts to filter out objs rows early that cannot possibly join with transit.x values.

UNION ALL View Index Utilization Mechanism

The quads view combines two tables via UNION ALL:

CREATE VIEW quads AS 
  SELECT c,s,p,o,NULL AS d FROM objs 
  UNION ALL 
  SELECT c,s,p,o,d FROM datas;

When querying this view with predicates:

SELECT o,d FROM quads WHERE s=303 AND p=9;

SQLite should push down the s=303 AND p=9 predicates to both objs and datas tables, leveraging their respective indexes. Effective predicate pushing allows index seeks instead of full table scans. The regression observed after the BLOOM filter fix indicates a breakdown in this optimization pathway.

2. Root Causes of BLOOM Filter Inefficiency and Predicate Pushdown Failure

BLOOM Filter Overhead in Recursive Context

  1. Filter Construction Cost: BLOOM filters require hashing candidate values from the transit.x set. When the recursive step produces a large number of distinct x values, the filter construction becomes computationally expensive.
  2. False Positive Rate Impact: While BLOOM filters eliminate most non-matching rows, even a 1% false positive rate forces unnecessary row evaluations. In deep recursion trees, these accumulate exponentially.
  3. Index Coverage Mismatch: The index_objs_op on (o,p) provides covering data for the initial setup phase but becomes less effective during recursive expansion if transit.x values are sparsely distributed across the index.
  4. Recursion Depth vs. Filter Efficiency: BLOOM filters optimize for large candidate sets, but shallow recursion depths (as seen in the fast 0.001s execution) render them counterproductive. The optimizer’s cost model overestimates the benefit of BLOOM filtering for small intermediate result sets.

UNION ALL View Index Utilization Regression

  1. View Materialization Thresholds: Recent SQLite versions may alter internal heuristics for view flattening. If the optimizer decides to materialize the quads view before applying predicates, index usage becomes impossible.
  2. Predicate Pushdown Blockers: New constraints in the query planner might prevent pushing s=303 AND p=9 into the view’s constituent UNION ALL branches. Common causes include:
    • Type mismatches between view columns and base table columns
    • Loss of nullability information through the view
    • Overly conservative safety checks in predicate mobility analysis
  3. Cost Estimation Errors: The optimizer might incorrectly calculate that full scans of objs and datas are cheaper than index seeks due to:
    • Outdated table statistics (ANALYZE data)
    • Misconfigured SQLITE_STAT4 histogram bins
    • Incorrect assumptions about correlation between s and p columns

3. Comprehensive Optimization Strategies for Recursive BLOOM Filters and View Index Restoration

Addressing Recursive CTE BLOOM Filter Overhead

Step 1: Query Plan Analysis and Forced Optimization Control

  • Pre-fix Workaround: Disable BLOOM filters using undocumented testctrl:

    .testctrl optimizations 0x80000  -- CLI only
    

    For application code, recompile SQLite with -DSQLITE_OMIT_BLOOMFILTER (not recommended long-term).

  • Post-fix Verification: Confirm the trunk version properly handles BLOOM filter cost-benefit analysis in recursive contexts. Test with:

    EXPLAIN QUERY PLAN
    WITH RECURSIVE transit(x) AS (...);
    

    Ensure the optimized plan uses direct index seeks without BLOOM filters when appropriate.

Step 2: Index Tuning for Recursive Traversal

  • Composite Index Optimization: Create an index aligning with recursive join patterns:

    CREATE INDEX index_objs_po ON objs(p,o,s);  -- Covering index
    

    This allows both setup and recursive terms to use the same index, reducing planner ambiguity.

  • Recursion Depth Monitoring: Instrument the query with termination limits:

    WITH RECURSIVE transit(x, depth) AS (
      SELECT s, 1 FROM objs WHERE p=9 AND o=32805
      UNION 
      SELECT objs.s, depth+1 
      FROM objs, transit 
      WHERE objs.p=9 AND objs.o=transit.x AND depth < 20
    )
    SELECT x FROM transit;
    

    Prevents runaway recursion while providing depth context for optimizer decisions.

Step 3: Manual Query Transformation

  • BREADTH-FIRST vs DEPTH-FIRST Search: Force materialization order using ORDER BY:

    WITH RECURSIVE transit(x) AS (...)
    SELECT x FROM transit ORDER BY rowid;
    

    Alters row processing order, which can significantly impact BLOOM filter efficacy.

  • Temporary Materialization: Dump intermediate results to temp tables:

    CREATE TEMP TABLE temp_transit AS
    WITH RECURSIVE transit(x) AS (...);
    

    Breaks the recursion into planner-visible steps, allowing better cost estimation.

Restoring Index Utilization in UNION ALL Views

Step 1: View Definition Analysis

  • Explicit Column Typing: Ensure view columns match base table types:

    CREATE VIEW quads AS 
    SELECT CAST(c AS INTEGER) AS c, 
           CAST(s AS INTEGER) AS s,
           ... 
    FROM objs 
    UNION ALL 
    ...;
    

    Eliminates type mismatch barriers to predicate pushdown.

  • Predicate Forwarding: Use LATERAL joins (SQLite 3.35+) to force predicate visibility:

    SELECT q.o, q.d 
    FROM quads q 
    WHERE EXISTS (
      SELECT 1 
      FROM (SELECT * FROM objs WHERE s=303 AND p=9 UNION ALL SELECT * FROM datas WHERE s=303 AND p=9) 
      WHERE rowid=q.rowid
    );
    

Step 2: Index Visibility Enhancement

  • Compound Index Creation: Build indexes covering all filtered columns:

    CREATE INDEX index_objs_sp ON objs(s,p,o);
    CREATE INDEX index_datas_sp ON datas(s,p,o,d);
    

    Allows covering index scans without base table access.

  • Statistics Regeneration: Force fresh cost calculations:

    ANALYZE;
    ANALYZE sqlite_schema;  -- Update internal schema stats
    

Step 3: Query Planner Overrides

  • Materialization Bypass: Use NOT MATERIALIZED hint (SQLite 3.35+):

    SELECT o,d FROM quads NOT MATERIALIZED WHERE s=303 AND p=9;
    

    Directs the optimizer to process the view as a subquery rather than materializing.

  • Subquery Factoring: Manually decompose the view:

    SELECT o,d FROM (
      SELECT c,s,p,o,NULL AS d FROM objs WHERE s=303 AND p=9
      UNION ALL 
      SELECT c,s,p,o,d FROM datas WHERE s=303 AND p=9
    );
    

    Explicit predicate placement bypasses view optimization ambiguities.

Cross-Issue Mitigation Techniques

Index Merge Optimization: For complex query patterns, leverage INDEX_MERGE:

SELECT o,d FROM quads 
WHERE s=303 AND p=9
AND (rowid IN (SELECT rowid FROM objs WHERE s=303 AND p=9) 
     OR rowid IN (SELECT rowid FROM datas WHERE s=303 AND p=9));

Partial Indexing: Create filtered indexes matching common query parameters:

CREATE INDEX index_objs_high_freq ON objs(s,p,o) WHERE s BETWEEN 300 AND 400;

Cost Threshold Adjustments: Modify optimizer parameters via PRAGMA:

PRAGMA optimizer_cost_limit=1000;       -- Default 1000000
PRAGMA optimizer_scan_cost=2000;        -- Default 40
PRAGMA optimizer_index_cost=100;        -- Default 100

Query Plan Fixation: Use INDEXED BY clauses to override planner choices:

SELECT o,d FROM quads 
WHERE s=303 AND p=9
INDEXED BY index_objs_sp, index_datas_sp;

Long-Term Maintenance Practices

  1. Version-Specific Plan Analysis: Regularly compare EXPLAIN QUERY PLAN outputs across SQLite versions to detect optimization regressions early.

  2. Index Coverage Monitoring: Use diagnostic queries to identify uncovered access patterns:

    SELECT DISTINCT sql FROM sqlite_stat4 WHERE tbl IN ('objs','datas');
    
  3. Recursive Query Profiling: Instrument CTEs with timing counters:

    WITH RECURSIVE transit(x, iter) AS (
      SELECT s, 1 FROM objs WHERE p=9 AND o=32805
      UNION ALL
      SELECT objs.s, iter+1 
      FROM objs, transit 
      WHERE objs.p=9 AND objs.o=transit.x AND iter < 100
    )
    SELECT x, MAX(iter) FROM transit GROUP BY x;
    
  4. View Optimization Triggers: Create instead-of triggers to customize view access paths:

    CREATE TRIGGER quads_optimize INSTEAD OF SELECT ON quads
    BEGIN
      SELECT o,d FROM objs WHERE s=NEW.s AND p=NEW.p
      UNION ALL 
      SELECT o,d FROM datas WHERE s=NEW.s AND p=NEW.p;
    END;
    
  5. Plan Stability Preservation: Archive known-good query plans using sqlite_plan table:

    CREATE TABLE sqlite_plan(query TEXT, plan TEXT);
    INSERT INTO sqlite_plan 
    VALUES ('SELECT ...', (SELECT PLAN FROM EXPLAIN_QUERY_PLAN(...)));
    

This comprehensive approach addresses both immediate performance pathologies and establishes guardrails against future optimization regressions. The recursive query optimizations focus on aligning index structures with traversal patterns while the view fixes emphasize predicate mobility and cost model calibration. Together, they restore the performance characteristics expected from SQLite’s sophisticated query planner while accounting for edge-case behaviors in complex schema constructs.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *