Bloom Filter Optimization Discrepancies in SQLite FTS5 Virtual Table Joins

Bloom Filter Utilization Differences Between JOIN Types on FTS5 Virtual Tables

The core issue revolves around inconsistent query optimization behavior when joining subsets of an FTS5 fts5vocab virtual table. Specifically, SQLite’s query planner employs a bloom filter to optimize LEFT JOIN operations but omits it for equivalent INNER JOIN queries. This discrepancy leads to significant performance degradation in the latter case, especially when joining large datasets.

Technical Context of FTS5 Virtual Tables

The fts5vocab virtual table type provides access to vocabulary statistics for an FTS5 full-text index. Its instance variant exposes per-document term frequencies, with rows structured as (term, col, doc, cnt). Critically, this table always returns rows ordered by term due to its internal implementation. SQLite’s optimizer leverages the orderByConsumed flag reported by virtual tables to determine whether sorting can be skipped. When joining on columns other than term (e.g., doc), the planner must decide whether to reorder inputs or apply filtering optimizations like bloom filters.

Observed Query Plan Behavior

Consider two subqueries filtering on distinct term values and joining on doc:

-- LEFT JOIN uses bloom filter
SELECT * 
FROM (SELECT * FROM FtsIndexInstance WHERE col='digest' AND term='myths') res1  
LEFT JOIN (SELECT * FROM FtsIndexInstance WHERE col='digest' AND term='brain') res2  
ON res1.doc = res2.doc;

-- INNER JOIN performs full scans
SELECT * 
FROM (SELECT * FROM FtsIndexInstance WHERE col='digest' AND term='myths') res1  
JOIN (SELECT * FROM FtsIndexInstance WHERE col='digest' AND term='brain') res2  
ON res1.doc = res2.doc;

The LEFT JOIN variant generates a plan with MATERIALIZE, SCAN, and BLOOM FILTER steps, efficiently pruning the search space. The INNER JOIN skips the bloom filter, resulting in two full SCAN operations. This forces a Cartesian product scan of both subquery results, causing exponential runtime growth with dataset size.

Impact of Missing Bloom Filters

Bloom filters reduce I/O and computation by probabilistically excluding non-matching rows early. Their absence in INNER JOIN forces SQLite to compare all doc values between subqueries, even when most pairs are irrelevant. This is particularly detrimental for FTS5 virtual tables, which often handle large text corpora with sparse term distributions.

Underlying Factors Affecting Bloom Filter Application in Virtual Table Joins

Three primary factors contribute to this behavior:

1. Join Type Semantics and Planner Heuristics

SQLite’s query planner treats LEFT JOIN and INNER JOIN differently due to their semantic guarantees:

  • LEFT JOIN preserves all rows from the left subquery (res1), making it advantageous to pre-filter the right subquery (res2). The bloom filter is applied to res2.doc, allowing rapid elimination of non-matching rows.
  • INNER JOIN requires matching rows from both subqueries. The planner assumes neither side can be prioritized for filtering, leading it to forgo bloom filters and default to nested loops.

This heuristic works well for conventional tables with indexed join columns but fails for virtual tables lacking native indexing on doc.

2. Virtual Table Indexing Limitations

FTS5 virtual tables expose pseudo-indexes tied to their internal storage. The VIRTUAL TABLE INDEX 1: notation indicates the use of the term-ordered index. However, no native index exists for doc, forcing the planner to handle joins on this column as unoptimized scans.

When a bloom filter is applied (as in LEFT JOIN), SQLite injects a dynamic filter on res2.doc based on res1.doc values. For INNER JOIN, the planner incorrectly assumes that both subqueries must be scanned in full, ignoring the opportunity to propagate filters between them.

3. orderByConsumed Mismatch

The orderByConsumed property reported by the FTS5 virtual table indicates that rows are emitted in term order. When joining on doc, this ordering is irrelevant, and SQLite cannot leverage it to optimize the join. The planner falls back to generic strategies, which inconsistently apply bloom filters depending on join type.

For conventional tables, the presence of an index on doc would override this issue. However, virtual tables cannot declare secondary indexes, leaving the planner with incomplete statistics.

Optimizing Query Plans for FTS5 Virtual Table Joins in SQLite

To address the performance gap, apply the following strategies:

1. Force Materialization with Subquery Isolation

Materializing subqueries can provide the planner with better statistics:

SELECT *  
FROM (SELECT * FROM FtsIndexInstance WHERE col='digest' AND term='myths') res1  
JOIN (SELECT doc FROM FtsIndexInstance WHERE col='digest' AND term='brain') res2  
ON res1.doc = res2.doc;

By selecting only doc from res2, the subquery’s materialized size is reduced, increasing the likelihood of bloom filter usage.

2. Use Explicit Index Hints for Virtual Tables

While FTS5 virtual tables don’t support traditional indexes, their pseudo-indexes can be hinted using MATERIALIZE or WITH RECURSIVE CTEs:

WITH res1 AS MATERIALIZED (  
  SELECT * FROM FtsIndexInstance WHERE col='digest' AND term='myths'  
), res2 AS MATERIALIZED (  
  SELECT * FROM FtsIndexInstance WHERE col='digest' AND term='brain'  
)  
SELECT * FROM res1 JOIN res2 ON res1.doc = res2.doc;

The MATERIALIZED hint forces SQLite to evaluate subqueries upfront, allowing the bloom filter to apply during the join phase.

3. Rewrite INNER JOIN as Filtered LEFT JOIN

If query semantics permit, rewrite the INNER JOIN to a LEFT JOIN with a WHERE clause:

SELECT *  
FROM (SELECT * FROM FtsIndexInstance WHERE col='digest' AND term='myths') res1  
LEFT JOIN (SELECT * FROM FtsIndexInstance WHERE col='digest' AND term='brain') res2  
ON res1.doc = res2.doc  
WHERE res2.doc IS NOT NULL;

This forces bloom filter usage while achieving the same result as an INNER JOIN.

4. Leverage Composite Virtual Table Indexes

FTS5 virtual tables allow compound pseudo-indexes on (term, col, doc). Restructure queries to align with this order:

SELECT *  
FROM FtsIndexInstance res1  
JOIN FtsIndexInstance res2  
ON res1.doc = res2.doc AND res1.term = 'myths' AND res2.term = 'brain'  
WHERE res1.col = 'digest' AND res2.col = 'digest';

By incorporating term into the join condition, the planner can leverage the virtual table’s native ordering, reducing the need for bloom filters.

5. Upgrade to SQLite 3.44+ for Enhanced Bloom Filter Logic

Later SQLite versions (3.44+) include optimizations for bloom filter applicability in INNER JOIN scenarios. Test the query under these versions to see if the planner behavior has improved.

6. Implement Application-Side Bloom Filters

For extreme cases, precompute doc values from one subquery and inject them into the other:

SELECT *  
FROM (SELECT * FROM FtsIndexInstance WHERE col='digest' AND term='myths') res1  
JOIN FtsIndexInstance res2  
ON res2.doc = res1.doc  
WHERE res2.term = 'brain' AND res2.col = 'digest';

This mimics a bloom filter by ensuring res2 is probed only for doc values present in res1.

Final Considerations

Always validate changes using EXPLAIN QUERY PLAN and benchmark against real-world data distributions. Virtual table joins require careful balancing between SQLite’s optimizer capabilities and the inherent limitations of their pseudo-indexing structures.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *