Bloom Filter Optimization Causes Incorrect Results Under RTRIM Collation Comparisons


Collapsing String Comparisons via RTRIM Collation and Bloom Filter Mismatch

When executing queries that combine string comparisons with non-binary collations and Bloom filter optimizations, SQLite may return incorrect results due to fundamental incompatibilities between hashing strategies and collation-aware equality checks. This manifests when comparing values that are considered equivalent under specific collation rules (e.g., RTRIM) but differ in their binary representations. The Bloom filter optimization, designed to accelerate equality checks by pre-filtering candidate rows using hash values, fails to account for collation rules when generating hashes. This results in false negatives during query execution, where valid rows are erroneously excluded from the result set.

The core problem arises from three interconnected factors:

  1. Collation-Aware Comparisons altering equality semantics (e.g., ' ' and '' being treated as equivalent under RTRIM collation),
  2. Bloom Filter Hashing relying on binary representations of values rather than collation-adjusted forms,
  3. Query Planner Decisions to apply Bloom filter optimizations in contexts where collation rules invalidate hash-based prefiltering.

In the provided example, the query SELECT COUNT(*) FROM t0, v0 WHERE t0.c0 COLLATE RTRIM = '' should return 3 due to one matching row in t0 (value ' ') and three distinct rows in the view v0. However, the Bloom filter optimization incorrectly filters out all rows because the hash of ' ' (binary value 0x20) does not match the hash of '' (binary value empty), even though they are considered equal under RTRIM collation. This discrepancy causes SQLite to produce a result of 0 instead of 3.


Hash Function Collision Requirements vs. Collation Rule Semantics

The root cause lies in the Bloom filter optimization’s inability to reconcile collation-defined equality with hash function behavior. Bloom filters operate by hashing values and using those hashes to probabilistically determine whether a value exists in a set. For binary equality checks, this is straightforward: two values must have identical byte sequences to be considered equal. However, collations like RTRIM redefine equality in ways that are not captured by binary hashing.

1. Collation-Specific Equality Redefinition

The RTRIM collation trims trailing whitespace before comparing strings. Under this rule, ' ' (a single space) and '' (empty string) are treated as equal because the trailing space is stripped. However, their binary representations differ, leading to distinct hash values when processed by SQLite’s default hash functions. Bloom filters rely on these hashes to exclude non-matching rows during query execution. When the hash of a candidate value (e.g., ' ') does not match the hash of the target value (''), the Bloom filter incorrectly discards the candidate row, even though the collation rule would deem them equal.

2. Hash Function Design Limitations

SQLite’s Bloom filter implementation uses a hash function optimized for speed and low collision rates. Prior to the fix, this hash function treated strings and blobs as sequences of bytes, generating hashes that reflected their exact binary content. This design works for binary comparisons but fails for collation-aware comparisons where semantically equivalent values have different binary representations. The hash function cannot dynamically adjust to collation rules, leading to mismatches between hash-based prefiltering and collation-based equality checks.

3. Query Planner Over-Application of Bloom Filters

The SQLite query planner may apply Bloom filter optimizations in contexts where collation rules invalidate the hash function’s assumptions. In the example, the join between t0 and v0 triggers the use of a Bloom filter to accelerate the WHERE t0.c0 COLLATE RTRIM = '' clause. However, the planner does not account for the fact that the collation rule redefines equality in a way that is incompatible with the Bloom filter’s hash-based logic. This results in incorrect filtering during query execution.


Resolving Collation vs. Bloom Filter Conflicts Through Targeted Fixes and Workarounds

Step 1: Validate SQLite Version and Optimization Settings

Confirm the SQLite version using sqlite3_version() or SELECT sqlite_version(). Versions prior to the fix (check-in 090304b870419acb) will exhibit the bug. If using a vulnerable version, proceed to disable Bloom filter optimizations temporarily:

PRAGMA compile_options;
-- Look for ENABLE_BLOOM_FILTER
PRAGMA disable_bloom_filter=1; -- If available in your build

Disabling the Bloom filter forces the query planner to use traditional join algorithms, bypassing the hash collision issue. Test the original query after applying this setting. If the result changes to 3, the Bloom filter optimization is confirmed as the culprit.

Step 2: Apply the Official Hash Function Patch

The fix involves modifying how Bloom filters hash strings and blobs when collations are in use. Instead of generating unique hashes for distinct binary values, all strings and blobs are assigned the same hash value. This crude workaround ensures that collation-defined equality checks are not short-circuited by hash mismatches. While this guarantees correctness, it severely degrades Bloom filter performance for string/blob columns, as the filter can no longer exclude non-matching rows efficiently.

To apply the fix:

  1. Upgrade to a SQLite build that includes the check-in 090304b870419acb or later.
  2. Recompile SQLite with the updated source code if using a custom build.

Step 3: Restrict Bloom Filter Usage for Collation-Affected Columns

If upgrading is not feasible, modify the query or schema to prevent the query planner from using Bloom filters on columns with non-binary collations. Techniques include:

  • Explicit Collation Removal: Avoid applying collations in join or filter clauses where possible. For example, pre-normalize data in the table to store values without trailing spaces:
    CREATE TABLE t0 (c0 TEXT COLLATE RTRIM); -- Apply collation at storage level
    INSERT INTO t0 VALUES (TRIM(c0)); -- Ensure stored values are normalized
    
  • Query Rewriting: Use subqueries or WITH clauses to materialize collation-adjusted values before joining:
    SELECT COUNT(*) 
    FROM (SELECT c0 COLLATE RTRIM AS c0_rtrim FROM t0) AS t0_normalized, 
         v0 
    WHERE t0_normalized.c0_rtrim = '';
    

    This forces the collation adjustment to occur before the join, potentially altering the query planner’s optimization choices.

Step 4: Monitor Performance and Apply Indexing Strategies

After applying the hash function patch, Bloom filters become less effective for string/blob columns. Mitigate performance regressions by:

  • Adding Indexes on Collation-Adjusted Columns:

    CREATE INDEX t0_c0_rtrim ON t0 (c0 COLLATE RTRIM);
    

    This allows the query planner to use index-based lookups instead of Bloom filters for collation-aware comparisons.

  • Using Covered Indexes: Include all columns referenced in the query to enable index-only scans:

    CREATE INDEX t0_c0_covered ON t0 (c0 COLLATE RTRIM) INCLUDE (c0);
    

Step 5: Evaluate Alternative Optimizations

If Bloom filter performance remains unsatisfactory, explore alternative optimizations:

  • Disable Bloom Filters Permanently for specific queries using SQLite’s C/C++ interface:
    sqlite3_db_config(db, SQLITE_DBCONFIG_ENABLE_BLOOM_FILTER, 0, 0);
    
  • Use Query Hints to influence the query planner’s choice of join algorithms (requires SQLite 3.32.0+):
    SELECT COUNT(*) FROM t0, v0 WHERE t0.c0 COLLATE RTRIM = '' /*+ NO_BLOOM_FILTER */;
    

Step 6: Await Further Optimizations in Future Releases

The initial fix intentionally degrades Bloom filter utility for strings/blobs to prioritize correctness. Follow SQLite’s changelogs for updates that restore Bloom filter efficiency while respecting collation rules. Future versions may introduce:

  • Collation-Aware Hash Functions: Generate hash values based on collation-adjusted representations of strings.
  • Dynamic Bloom Filter Selection: Enable the query planner to disable Bloom filters automatically when collations invalidate hash-based optimizations.

This comprehensive approach addresses both the immediate symptom (incorrect query results) and the underlying design conflict between collation rules and Bloom filter optimizations. By combining temporary workarounds, schema adjustments, and long-term upgrade planning, developers can maintain correctness while minimizing performance penalties.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *