Bloom Filter Usage Leads to Suboptimal Query Performance in SQLite Joins
Understanding the Performance Impact of Bloom Filters in Join Order Selection
Scenario: Join Order Selection with Bloom Filter vs. Indexed Filtering
The core issue revolves around a SQL query joining two large tables (vout and transactions) with a WHERE filter on transactions.block_id. The query planner selects a suboptimal execution plan involving a Bloom filter when joining the tables in a specific order, resulting in a 21-second execution time. Rewriting the query to force a different join order via CROSS JOIN reduces execution time to 72 milliseconds. This discrepancy arises from how the query planner prioritizes index usage and filtering conditions. Key elements include:
-
Table Schema & Indexes
transactionshas an index onblock_id(transactions_block_id_idx)vouthas two indexes:tx_id(vout_tx_id_idx) andaddr_id+tx_id(vout_addr_tx_idx)- Foreign key relationships:
vout.tx_idreferencestransactions.id
-
Query Variations
- Original query:
SELECT COUNT(DISTINCT addr_id) FROM vout JOIN transactions t ON t.id = vout.tx_id WHERE t.block_id < 10000; - Optimized query using
CROSS JOIN:SELECT COUNT(DISTINCT addr_id) FROM transactions t CROSS JOIN vout ON t.id = vout.tx_id WHERE t.block_id < 10000;
- Original query:
-
Execution Plan Differences
- Original Plan:
- Scans
voutusingvout_addr_tx_idx(covering index) - Applies Bloom filter on
t.id - Searches
transactionsvia primary key
- Scans
- Optimized Plan:
- Uses
transactions_block_id_idxto filter rows early - Joins with
voutviavout_tx_id_idx
- Uses
- Original Plan:
-
Performance Metrics
- Original query: 18.2 seconds (real time)
- Optimized query: 0.04 seconds (real time)
Root Causes of Suboptimal Bloom Filter Usage
1. Index Selectivity and Statistics Misleading the Query Planner
The sqlite_stat1 data provided to the query planner indicates:
transactions_block_id_idxhas 25,342,624 rows with 20 distinct values.vout_tx_id_idxhas 84,748,018 rows with 4 distinct values.vout_addr_tx_idxhas 93,370,966 rows with 2 distinct values.
These statistics suggest that vout_addr_tx_idx has higher selectivity for addr_id (the target of COUNT(DISTINCT)), prompting the planner to prioritize it. However, this index does not directly support the t.block_id < 10000 filter. The Bloom filter is deployed to narrow down t.id values during the join, but this approach fails to leverage the transactions_block_id_idx for early filtering.
2. Bloom Filter Overhead in Nested Loop Joins
Bloom filters optimize joins by probabilistically excluding non-matching rows early. However, their effectiveness depends on:
- Filter placement: Applying a Bloom filter on the inner loop (
transactionstable) forces repeated checks against the filter, increasing CPU overhead. - Index alignment: The Bloom filter on
t.idrequires probing the primary key index oftransactions, which is less efficient than usingtransactions_block_id_idxfor direct range scans.
3. Join Order Decision Heuristics
SQLite’s query planner defaults to scanning the smaller or more selective table first. The vout_addr_tx_idx’s statistics mislead the planner into believing that scanning vout first is optimal. In reality, filtering transactions using block_id first drastically reduces the working set:
- Only rows where
block_id < 10000are processed (≈0.04% of 25M rows). - This reduces the number of
tx_idvalues joined withvout, minimizing index lookups.
4. Missing or Outdated Statistics
The sqlite_stat1 entries lack correlation data between transactions.block_id and vout.tx_id. Without multi-column statistics, the planner cannot estimate the combined selectivity of block_id < 10000 and tx_id matches, leading to incorrect cost assumptions.
Resolving Bloom Filter-Induced Performance Degradation
Step 1: Validate and Update Statistics
-
Re-analyze Tables:
ANALYZE;This regenerates
sqlite_stat1with fresh data. Ensure theANALYZEcommand accounts for recent data distribution changes. -
Manual Statistics Adjustment:
If automatic analysis is insufficient, manually updatesqlite_stat1to reflect the correlation betweenblock_idandidintransactions:UPDATE sqlite_stat1 SET stat = '25342624 20 10000' -- Add third value for block_id selectivity WHERE idx = 'transactions_block_id_idx';This informs the planner that
block_id < 10000selects ≈10,000 rows.
Step 2: Force Join Order with CROSS JOIN
Rewrite the query to enforce transactions as the driving table:
SELECT COUNT(DISTINCT addr_id)
FROM transactions t
CROSS JOIN vout ON t.id = vout.tx_id
WHERE t.block_id < 10000;
This syntax hints the planner to process transactions first, leveraging transactions_block_id_idx for early filtering.
Step 3: Disable Bloom Filters Temporarily
Use the SQLite CLI’s .testctrl command to disable Bloom filters for testing:
.testctrl optimizations 0x80000 -- Disable Bloom filters
Run the original query to compare performance. If Bloom filters were masking a worse plan, this reveals the true cost of the chosen index.
Step 4: Optimize Indexing Strategy
-
Add a Covering Index on
transactions:CREATE INDEX transactions_block_id_id_idx ON transactions(block_id, id);This allows the planner to satisfy both the
block_idfilter andidjoin condition using a single index. -
Modify
voutIndexes for Better Joins:
Replacevout_addr_tx_idxwith an index that includestx_idas the leading column:DROP INDEX vout_addr_tx_idx; CREATE INDEX vout_tx_id_addr_idx ON vout(tx_id, addr_id);This aligns with the join condition
t.id = vout.tx_id, enabling index-only scans foraddr_id.
Step 5: Use Subqueries or CTEs to Isolate Filtering
Break down the query to isolate the block_id filter:
WITH filtered_transactions AS (
SELECT id FROM transactions WHERE block_id < 10000
)
SELECT COUNT(DISTINCT addr_id)
FROM vout
WHERE tx_id IN (SELECT id FROM filtered_transactions);
This structure encourages the planner to process the transactions filter first.
Step 6: Monitor and Tune Query Planner Heuristics
Adjust SQLite’s cost model parameters to favor indexed range scans over full scans:
PRAGMA optimizer_control = 'cost_class ower=100';
Increase the ower (overall weight of rowid accesses) to make primary key lookups more expensive relative to index scans.
Step 7: Evaluate Query Plan Stability
Use EXPLAIN QUERY PLAN to verify the optimized strategy persists across database updates. If the plan regresses, consider:
- SQLite Version Upgrades: Later versions may include improved heuristics.
- Query-Specific Optimizer Hints: Use
INDEXED BYto enforce index usage:SELECT COUNT(DISTINCT addr_id) FROM transactions t INDEXED BY transactions_block_id_idx JOIN vout ON t.id = vout.tx_id WHERE t.block_id < 10000;
Step 8: Profile Resource Utilization
Identify bottlenecks using OS-level tools:
- I/O Wait: High
systime indicates disk contention. Ensure the database is on an SSD and adequately cached. - CPU Utilization: High
usertime suggests Bloom filter overhead.
Final Recommendation: Hybrid Approach
Combine CROSS JOIN syntax with updated indexes for robustness:
CREATE INDEX transactions_block_id_id_idx ON transactions(block_id, id);
CREATE INDEX vout_tx_id_addr_idx ON vout(tx_id, addr_id);
SELECT COUNT(DISTINCT addr_id)
FROM transactions t
CROSS JOIN vout ON t.id = vout.tx_id
WHERE t.block_id < 10000;
This guarantees transactions is filtered first, uses optimal indexes, and avoids Bloom filter inefficiencies.
By systematically addressing statistics, join order, indexing, and planner heuristics, the performance discrepancy caused by Bloom filter usage is resolved, achieving sub-100ms query execution.