Bloom Filter Optimization Discrepancies in SQLite FTS5 Virtual Table Joins
Bloom Filter Utilization Differences Between JOIN Types on FTS5 Virtual Tables
The core issue revolves around inconsistent query optimization behavior when joining subsets of an FTS5 fts5vocab
virtual table. Specifically, SQLite’s query planner employs a bloom filter to optimize LEFT JOIN
operations but omits it for equivalent INNER JOIN
queries. This discrepancy leads to significant performance degradation in the latter case, especially when joining large datasets.
Technical Context of FTS5 Virtual Tables
The fts5vocab
virtual table type provides access to vocabulary statistics for an FTS5 full-text index. Its instance
variant exposes per-document term frequencies, with rows structured as (term, col, doc, cnt)
. Critically, this table always returns rows ordered by term
due to its internal implementation. SQLite’s optimizer leverages the orderByConsumed
flag reported by virtual tables to determine whether sorting can be skipped. When joining on columns other than term
(e.g., doc
), the planner must decide whether to reorder inputs or apply filtering optimizations like bloom filters.
Observed Query Plan Behavior
Consider two subqueries filtering on distinct term
values and joining on doc
:
-- LEFT JOIN uses bloom filter
SELECT *
FROM (SELECT * FROM FtsIndexInstance WHERE col='digest' AND term='myths') res1
LEFT JOIN (SELECT * FROM FtsIndexInstance WHERE col='digest' AND term='brain') res2
ON res1.doc = res2.doc;
-- INNER JOIN performs full scans
SELECT *
FROM (SELECT * FROM FtsIndexInstance WHERE col='digest' AND term='myths') res1
JOIN (SELECT * FROM FtsIndexInstance WHERE col='digest' AND term='brain') res2
ON res1.doc = res2.doc;
The LEFT JOIN
variant generates a plan with MATERIALIZE
, SCAN
, and BLOOM FILTER
steps, efficiently pruning the search space. The INNER JOIN
skips the bloom filter, resulting in two full SCAN
operations. This forces a Cartesian product scan of both subquery results, causing exponential runtime growth with dataset size.
Impact of Missing Bloom Filters
Bloom filters reduce I/O and computation by probabilistically excluding non-matching rows early. Their absence in INNER JOIN
forces SQLite to compare all doc
values between subqueries, even when most pairs are irrelevant. This is particularly detrimental for FTS5 virtual tables, which often handle large text corpora with sparse term distributions.
Underlying Factors Affecting Bloom Filter Application in Virtual Table Joins
Three primary factors contribute to this behavior:
1. Join Type Semantics and Planner Heuristics
SQLite’s query planner treats LEFT JOIN
and INNER JOIN
differently due to their semantic guarantees:
LEFT JOIN
preserves all rows from the left subquery (res1
), making it advantageous to pre-filter the right subquery (res2
). The bloom filter is applied tores2.doc
, allowing rapid elimination of non-matching rows.INNER JOIN
requires matching rows from both subqueries. The planner assumes neither side can be prioritized for filtering, leading it to forgo bloom filters and default to nested loops.
This heuristic works well for conventional tables with indexed join columns but fails for virtual tables lacking native indexing on doc
.
2. Virtual Table Indexing Limitations
FTS5 virtual tables expose pseudo-indexes tied to their internal storage. The VIRTUAL TABLE INDEX 1:
notation indicates the use of the term
-ordered index. However, no native index exists for doc
, forcing the planner to handle joins on this column as unoptimized scans.
When a bloom filter is applied (as in LEFT JOIN
), SQLite injects a dynamic filter on res2.doc
based on res1.doc
values. For INNER JOIN
, the planner incorrectly assumes that both subqueries must be scanned in full, ignoring the opportunity to propagate filters between them.
3. orderByConsumed Mismatch
The orderByConsumed
property reported by the FTS5 virtual table indicates that rows are emitted in term
order. When joining on doc
, this ordering is irrelevant, and SQLite cannot leverage it to optimize the join. The planner falls back to generic strategies, which inconsistently apply bloom filters depending on join type.
For conventional tables, the presence of an index on doc
would override this issue. However, virtual tables cannot declare secondary indexes, leaving the planner with incomplete statistics.
Optimizing Query Plans for FTS5 Virtual Table Joins in SQLite
To address the performance gap, apply the following strategies:
1. Force Materialization with Subquery Isolation
Materializing subqueries can provide the planner with better statistics:
SELECT *
FROM (SELECT * FROM FtsIndexInstance WHERE col='digest' AND term='myths') res1
JOIN (SELECT doc FROM FtsIndexInstance WHERE col='digest' AND term='brain') res2
ON res1.doc = res2.doc;
By selecting only doc
from res2
, the subquery’s materialized size is reduced, increasing the likelihood of bloom filter usage.
2. Use Explicit Index Hints for Virtual Tables
While FTS5 virtual tables don’t support traditional indexes, their pseudo-indexes can be hinted using MATERIALIZE
or WITH RECURSIVE
CTEs:
WITH res1 AS MATERIALIZED (
SELECT * FROM FtsIndexInstance WHERE col='digest' AND term='myths'
), res2 AS MATERIALIZED (
SELECT * FROM FtsIndexInstance WHERE col='digest' AND term='brain'
)
SELECT * FROM res1 JOIN res2 ON res1.doc = res2.doc;
The MATERIALIZED
hint forces SQLite to evaluate subqueries upfront, allowing the bloom filter to apply during the join phase.
3. Rewrite INNER JOIN as Filtered LEFT JOIN
If query semantics permit, rewrite the INNER JOIN
to a LEFT JOIN
with a WHERE
clause:
SELECT *
FROM (SELECT * FROM FtsIndexInstance WHERE col='digest' AND term='myths') res1
LEFT JOIN (SELECT * FROM FtsIndexInstance WHERE col='digest' AND term='brain') res2
ON res1.doc = res2.doc
WHERE res2.doc IS NOT NULL;
This forces bloom filter usage while achieving the same result as an INNER JOIN
.
4. Leverage Composite Virtual Table Indexes
FTS5 virtual tables allow compound pseudo-indexes on (term, col, doc)
. Restructure queries to align with this order:
SELECT *
FROM FtsIndexInstance res1
JOIN FtsIndexInstance res2
ON res1.doc = res2.doc AND res1.term = 'myths' AND res2.term = 'brain'
WHERE res1.col = 'digest' AND res2.col = 'digest';
By incorporating term
into the join condition, the planner can leverage the virtual table’s native ordering, reducing the need for bloom filters.
5. Upgrade to SQLite 3.44+ for Enhanced Bloom Filter Logic
Later SQLite versions (3.44+) include optimizations for bloom filter applicability in INNER JOIN
scenarios. Test the query under these versions to see if the planner behavior has improved.
6. Implement Application-Side Bloom Filters
For extreme cases, precompute doc
values from one subquery and inject them into the other:
SELECT *
FROM (SELECT * FROM FtsIndexInstance WHERE col='digest' AND term='myths') res1
JOIN FtsIndexInstance res2
ON res2.doc = res1.doc
WHERE res2.term = 'brain' AND res2.col = 'digest';
This mimics a bloom filter by ensuring res2
is probed only for doc
values present in res1
.
Final Considerations
Always validate changes using EXPLAIN QUERY PLAN
and benchmark against real-world data distributions. Virtual table joins require careful balancing between SQLite’s optimizer capabilities and the inherent limitations of their pseudo-indexing structures.