Bloom Filter Usage Leads to Suboptimal Query Performance in SQLite Joins
Understanding the Performance Impact of Bloom Filters in Join Order Selection
Scenario: Join Order Selection with Bloom Filter vs. Indexed Filtering
The core issue revolves around a SQL query joining two large tables (vout
and transactions
) with a WHERE
filter on transactions.block_id
. The query planner selects a suboptimal execution plan involving a Bloom filter when joining the tables in a specific order, resulting in a 21-second execution time. Rewriting the query to force a different join order via CROSS JOIN
reduces execution time to 72 milliseconds. This discrepancy arises from how the query planner prioritizes index usage and filtering conditions. Key elements include:
Table Schema & Indexes
transactions
has an index onblock_id
(transactions_block_id_idx
)vout
has two indexes:tx_id
(vout_tx_id_idx
) andaddr_id+tx_id
(vout_addr_tx_idx
)- Foreign key relationships:
vout.tx_id
referencestransactions.id
Query Variations
- Original query:
SELECT COUNT(DISTINCT addr_id) FROM vout JOIN transactions t ON t.id = vout.tx_id WHERE t.block_id < 10000;
- Optimized query using
CROSS JOIN
:SELECT COUNT(DISTINCT addr_id) FROM transactions t CROSS JOIN vout ON t.id = vout.tx_id WHERE t.block_id < 10000;
- Original query:
Execution Plan Differences
- Original Plan:
- Scans
vout
usingvout_addr_tx_idx
(covering index) - Applies Bloom filter on
t.id
- Searches
transactions
via primary key
- Scans
- Optimized Plan:
- Uses
transactions_block_id_idx
to filter rows early - Joins with
vout
viavout_tx_id_idx
- Uses
- Original Plan:
Performance Metrics
- Original query: 18.2 seconds (real time)
- Optimized query: 0.04 seconds (real time)
Root Causes of Suboptimal Bloom Filter Usage
1. Index Selectivity and Statistics Misleading the Query Planner
The sqlite_stat1
data provided to the query planner indicates:
transactions_block_id_idx
has 25,342,624 rows with 20 distinct values.vout_tx_id_idx
has 84,748,018 rows with 4 distinct values.vout_addr_tx_idx
has 93,370,966 rows with 2 distinct values.
These statistics suggest that vout_addr_tx_idx
has higher selectivity for addr_id
(the target of COUNT(DISTINCT)
), prompting the planner to prioritize it. However, this index does not directly support the t.block_id < 10000
filter. The Bloom filter is deployed to narrow down t.id
values during the join, but this approach fails to leverage the transactions_block_id_idx
for early filtering.
2. Bloom Filter Overhead in Nested Loop Joins
Bloom filters optimize joins by probabilistically excluding non-matching rows early. However, their effectiveness depends on:
- Filter placement: Applying a Bloom filter on the inner loop (
transactions
table) forces repeated checks against the filter, increasing CPU overhead. - Index alignment: The Bloom filter on
t.id
requires probing the primary key index oftransactions
, which is less efficient than usingtransactions_block_id_idx
for direct range scans.
3. Join Order Decision Heuristics
SQLite’s query planner defaults to scanning the smaller or more selective table first. The vout_addr_tx_idx
’s statistics mislead the planner into believing that scanning vout
first is optimal. In reality, filtering transactions
using block_id
first drastically reduces the working set:
- Only rows where
block_id < 10000
are processed (≈0.04% of 25M rows). - This reduces the number of
tx_id
values joined withvout
, minimizing index lookups.
4. Missing or Outdated Statistics
The sqlite_stat1
entries lack correlation data between transactions.block_id
and vout.tx_id
. Without multi-column statistics, the planner cannot estimate the combined selectivity of block_id < 10000
and tx_id
matches, leading to incorrect cost assumptions.
Resolving Bloom Filter-Induced Performance Degradation
Step 1: Validate and Update Statistics
Re-analyze Tables:
ANALYZE;
This regenerates
sqlite_stat1
with fresh data. Ensure theANALYZE
command accounts for recent data distribution changes.Manual Statistics Adjustment:
If automatic analysis is insufficient, manually updatesqlite_stat1
to reflect the correlation betweenblock_id
andid
intransactions
:UPDATE sqlite_stat1 SET stat = '25342624 20 10000' -- Add third value for block_id selectivity WHERE idx = 'transactions_block_id_idx';
This informs the planner that
block_id < 10000
selects ≈10,000 rows.
Step 2: Force Join Order with CROSS JOIN
Rewrite the query to enforce transactions
as the driving table:
SELECT COUNT(DISTINCT addr_id)
FROM transactions t
CROSS JOIN vout ON t.id = vout.tx_id
WHERE t.block_id < 10000;
This syntax hints the planner to process transactions
first, leveraging transactions_block_id_idx
for early filtering.
Step 3: Disable Bloom Filters Temporarily
Use the SQLite CLI’s .testctrl
command to disable Bloom filters for testing:
.testctrl optimizations 0x80000 -- Disable Bloom filters
Run the original query to compare performance. If Bloom filters were masking a worse plan, this reveals the true cost of the chosen index.
Step 4: Optimize Indexing Strategy
Add a Covering Index on
transactions
:CREATE INDEX transactions_block_id_id_idx ON transactions(block_id, id);
This allows the planner to satisfy both the
block_id
filter andid
join condition using a single index.Modify
vout
Indexes for Better Joins:
Replacevout_addr_tx_idx
with an index that includestx_id
as the leading column:DROP INDEX vout_addr_tx_idx; CREATE INDEX vout_tx_id_addr_idx ON vout(tx_id, addr_id);
This aligns with the join condition
t.id = vout.tx_id
, enabling index-only scans foraddr_id
.
Step 5: Use Subqueries or CTEs to Isolate Filtering
Break down the query to isolate the block_id
filter:
WITH filtered_transactions AS (
SELECT id FROM transactions WHERE block_id < 10000
)
SELECT COUNT(DISTINCT addr_id)
FROM vout
WHERE tx_id IN (SELECT id FROM filtered_transactions);
This structure encourages the planner to process the transactions
filter first.
Step 6: Monitor and Tune Query Planner Heuristics
Adjust SQLite’s cost model parameters to favor indexed range scans over full scans:
PRAGMA optimizer_control = 'cost_class ower=100';
Increase the ower
(overall weight of rowid accesses) to make primary key lookups more expensive relative to index scans.
Step 7: Evaluate Query Plan Stability
Use EXPLAIN QUERY PLAN
to verify the optimized strategy persists across database updates. If the plan regresses, consider:
- SQLite Version Upgrades: Later versions may include improved heuristics.
- Query-Specific Optimizer Hints: Use
INDEXED BY
to enforce index usage:SELECT COUNT(DISTINCT addr_id) FROM transactions t INDEXED BY transactions_block_id_idx JOIN vout ON t.id = vout.tx_id WHERE t.block_id < 10000;
Step 8: Profile Resource Utilization
Identify bottlenecks using OS-level tools:
- I/O Wait: High
sys
time indicates disk contention. Ensure the database is on an SSD and adequately cached. - CPU Utilization: High
user
time suggests Bloom filter overhead.
Final Recommendation: Hybrid Approach
Combine CROSS JOIN
syntax with updated indexes for robustness:
CREATE INDEX transactions_block_id_id_idx ON transactions(block_id, id);
CREATE INDEX vout_tx_id_addr_idx ON vout(tx_id, addr_id);
SELECT COUNT(DISTINCT addr_id)
FROM transactions t
CROSS JOIN vout ON t.id = vout.tx_id
WHERE t.block_id < 10000;
This guarantees transactions
is filtered first, uses optimal indexes, and avoids Bloom filter inefficiencies.
By systematically addressing statistics, join order, indexing, and planner heuristics, the performance discrepancy caused by Bloom filter usage is resolved, achieving sub-100ms query execution.