Bloom Filter Usage Leads to Suboptimal Query Performance in SQLite Joins

Understanding the Performance Impact of Bloom Filters in Join Order Selection

Scenario: Join Order Selection with Bloom Filter vs. Indexed Filtering

The core issue revolves around a SQL query joining two large tables (vout and transactions) with a WHERE filter on transactions.block_id. The query planner selects a suboptimal execution plan involving a Bloom filter when joining the tables in a specific order, resulting in a 21-second execution time. Rewriting the query to force a different join order via CROSS JOIN reduces execution time to 72 milliseconds. This discrepancy arises from how the query planner prioritizes index usage and filtering conditions. Key elements include:

  1. Table Schema & Indexes

    • transactions has an index on block_id (transactions_block_id_idx)
    • vout has two indexes: tx_id (vout_tx_id_idx) and addr_id+tx_id (vout_addr_tx_idx)
    • Foreign key relationships: vout.tx_id references transactions.id
  2. Query Variations

    • Original query:
      SELECT COUNT(DISTINCT addr_id)
      FROM vout
      JOIN transactions t ON t.id = vout.tx_id
      WHERE t.block_id < 10000;
      
    • Optimized query using CROSS JOIN:
      SELECT COUNT(DISTINCT addr_id)
      FROM transactions t
      CROSS JOIN vout ON t.id = vout.tx_id
      WHERE t.block_id < 10000;
      
  3. Execution Plan Differences

    • Original Plan:
      • Scans vout using vout_addr_tx_idx (covering index)
      • Applies Bloom filter on t.id
      • Searches transactions via primary key
    • Optimized Plan:
      • Uses transactions_block_id_idx to filter rows early
      • Joins with vout via vout_tx_id_idx
  4. Performance Metrics

    • Original query: 18.2 seconds (real time)
    • Optimized query: 0.04 seconds (real time)

Root Causes of Suboptimal Bloom Filter Usage

1. Index Selectivity and Statistics Misleading the Query Planner

The sqlite_stat1 data provided to the query planner indicates:

  • transactions_block_id_idx has 25,342,624 rows with 20 distinct values.
  • vout_tx_id_idx has 84,748,018 rows with 4 distinct values.
  • vout_addr_tx_idx has 93,370,966 rows with 2 distinct values.

These statistics suggest that vout_addr_tx_idx has higher selectivity for addr_id (the target of COUNT(DISTINCT)), prompting the planner to prioritize it. However, this index does not directly support the t.block_id < 10000 filter. The Bloom filter is deployed to narrow down t.id values during the join, but this approach fails to leverage the transactions_block_id_idx for early filtering.

2. Bloom Filter Overhead in Nested Loop Joins

Bloom filters optimize joins by probabilistically excluding non-matching rows early. However, their effectiveness depends on:

  • Filter placement: Applying a Bloom filter on the inner loop (transactions table) forces repeated checks against the filter, increasing CPU overhead.
  • Index alignment: The Bloom filter on t.id requires probing the primary key index of transactions, which is less efficient than using transactions_block_id_idx for direct range scans.

3. Join Order Decision Heuristics

SQLite’s query planner defaults to scanning the smaller or more selective table first. The vout_addr_tx_idx’s statistics mislead the planner into believing that scanning vout first is optimal. In reality, filtering transactions using block_id first drastically reduces the working set:

  • Only rows where block_id < 10000 are processed (≈0.04% of 25M rows).
  • This reduces the number of tx_id values joined with vout, minimizing index lookups.

4. Missing or Outdated Statistics

The sqlite_stat1 entries lack correlation data between transactions.block_id and vout.tx_id. Without multi-column statistics, the planner cannot estimate the combined selectivity of block_id < 10000 and tx_id matches, leading to incorrect cost assumptions.

Resolving Bloom Filter-Induced Performance Degradation

Step 1: Validate and Update Statistics

  1. Re-analyze Tables:

    ANALYZE;
    

    This regenerates sqlite_stat1 with fresh data. Ensure the ANALYZE command accounts for recent data distribution changes.

  2. Manual Statistics Adjustment:
    If automatic analysis is insufficient, manually update sqlite_stat1 to reflect the correlation between block_id and id in transactions:

    UPDATE sqlite_stat1 
    SET stat = '25342624 20 10000'  -- Add third value for block_id selectivity
    WHERE idx = 'transactions_block_id_idx';
    

    This informs the planner that block_id < 10000 selects ≈10,000 rows.

Step 2: Force Join Order with CROSS JOIN

Rewrite the query to enforce transactions as the driving table:

SELECT COUNT(DISTINCT addr_id)
FROM transactions t
CROSS JOIN vout ON t.id = vout.tx_id
WHERE t.block_id < 10000;

This syntax hints the planner to process transactions first, leveraging transactions_block_id_idx for early filtering.

Step 3: Disable Bloom Filters Temporarily

Use the SQLite CLI’s .testctrl command to disable Bloom filters for testing:

.testctrl optimizations 0x80000  -- Disable Bloom filters

Run the original query to compare performance. If Bloom filters were masking a worse plan, this reveals the true cost of the chosen index.

Step 4: Optimize Indexing Strategy

  1. Add a Covering Index on transactions:

    CREATE INDEX transactions_block_id_id_idx 
    ON transactions(block_id, id);
    

    This allows the planner to satisfy both the block_id filter and id join condition using a single index.

  2. Modify vout Indexes for Better Joins:
    Replace vout_addr_tx_idx with an index that includes tx_id as the leading column:

    DROP INDEX vout_addr_tx_idx;
    CREATE INDEX vout_tx_id_addr_idx ON vout(tx_id, addr_id);
    

    This aligns with the join condition t.id = vout.tx_id, enabling index-only scans for addr_id.

Step 5: Use Subqueries or CTEs to Isolate Filtering

Break down the query to isolate the block_id filter:

WITH filtered_transactions AS (
  SELECT id FROM transactions WHERE block_id < 10000
)
SELECT COUNT(DISTINCT addr_id)
FROM vout
WHERE tx_id IN (SELECT id FROM filtered_transactions);

This structure encourages the planner to process the transactions filter first.

Step 6: Monitor and Tune Query Planner Heuristics

Adjust SQLite’s cost model parameters to favor indexed range scans over full scans:

PRAGMA optimizer_control = 'cost_class ower=100';

Increase the ower (overall weight of rowid accesses) to make primary key lookups more expensive relative to index scans.

Step 7: Evaluate Query Plan Stability

Use EXPLAIN QUERY PLAN to verify the optimized strategy persists across database updates. If the plan regresses, consider:

  • SQLite Version Upgrades: Later versions may include improved heuristics.
  • Query-Specific Optimizer Hints: Use INDEXED BY to enforce index usage:
    SELECT COUNT(DISTINCT addr_id)
    FROM transactions t INDEXED BY transactions_block_id_idx
    JOIN vout ON t.id = vout.tx_id
    WHERE t.block_id < 10000;
    

Step 8: Profile Resource Utilization

Identify bottlenecks using OS-level tools:

  • I/O Wait: High sys time indicates disk contention. Ensure the database is on an SSD and adequately cached.
  • CPU Utilization: High user time suggests Bloom filter overhead.

Final Recommendation: Hybrid Approach

Combine CROSS JOIN syntax with updated indexes for robustness:

CREATE INDEX transactions_block_id_id_idx ON transactions(block_id, id);
CREATE INDEX vout_tx_id_addr_idx ON vout(tx_id, addr_id);

SELECT COUNT(DISTINCT addr_id)
FROM transactions t
CROSS JOIN vout ON t.id = vout.tx_id
WHERE t.block_id < 10000;

This guarantees transactions is filtered first, uses optimal indexes, and avoids Bloom filter inefficiencies.

By systematically addressing statistics, join order, indexing, and planner heuristics, the performance discrepancy caused by Bloom filter usage is resolved, achieving sub-100ms query execution.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *