GLOB Index Acceleration Fails with Escaped Special Characters in SQLite


Understanding GLOB Pattern Escaping and Index Utilization Challenges

Issue Overview: Escaped Special Characters in GLOB Patterns Prevent Index Acceleration

The core issue revolves around SQLite’s inability to utilize indices for GLOB queries when the pattern includes escaped special characters such as *, [, or ]. This limitation becomes apparent when directory paths or filenames contain these characters, requiring escaping via square brackets (e.g., [[] to match a literal [). While escaping ensures correct query results, it disables the query planner’s ability to leverage indices for acceleration, resulting in full table scans (SCAN files) instead of index-optimized searches (SEARCH files USING INDEX).

Example Breakdown
Consider a table files(path TEXT) with an index files_by_path on path.

  • A query for path GLOB 'a*' (all paths starting with "a") uses the index:
    QUERY PLAN
    `--SEARCH files USING COVERING INDEX files_by_path (path>? AND path<?)
    
  • Escaping a character (e.g., path GLOB '[a]*') or querying paths starting with * (e.g., path GLOB '[*]*') results in a full scan:
    QUERY PLAN
    `--SCAN files
    

The problem arises because SQLite’s query planner does not recognize that [a] in a GLOB pattern is semantically equivalent to the unescaped a. This prevents the index from being used to narrow down the search range. In real-world scenarios (e.g., querying directory structures with paths like /folderwith[[]brackets[]]/subdir/*), this leads to inefficient scans over large datasets.

Root Causes: How SQLite’s GLOB Optimization Works (and When It Doesn’t)

  1. LIKE/GLOB Optimization Constraints
    SQLite’s index acceleration for GLOB is derived from its LIKE optimization rules (documented here). These rules allow index usage only for patterns that:

    • Begin with a literal prefix (e.g., 'a*').
    • Do not contain unescaped special characters (*, ?, [, ]) in the literal prefix.

    Escaping a character with [ ] (e.g., '[a]*') breaks the literal prefix recognition, as the optimizer treats [a] as a character set, not a literal a.

  2. Pattern Parsing Limitations
    The query planner does not simplify patterns containing escaped characters. For example:

    • '[a]*' is not simplified to 'a*' despite [a] being a single-character set.
    • Patterns with nested escapes (e.g., '[[]*' to match paths starting with [) are not parsed to extract the literal prefix.
  3. Collation and UTF-8 Handling
    Escaped characters in multi-byte UTF-8 sequences complicate pattern analysis. The optimizer would need to validate that [ ] encloses a single valid UTF-8 codepoint, which is not implemented.

Solutions: Manual Index Range Queries and Pattern Rewriting

1. Rewrite GLOB Queries Using Index-Friendly Range Conditions

Replace GLOB with BETWEEN or inequality conditions to force index usage. This requires deriving upper and lower bounds for the search range.

Example: Querying Paths Starting with '[a]'
Instead of:

SELECT * FROM files WHERE path GLOB '[a]*';

Use:

SELECT * FROM files 
WHERE path BETWEEN '[a' AND '[a' || char(0x10FFFF);
  • BETWEEN defines a range from '[a' (the smallest string starting with [a) to '[a' concatenated with the highest possible UTF-8 character (0x10FFFF), ensuring all paths starting with [a are included.
  • The char(0x10FFFF) function appends the Unicode replacement character, which sorts after all valid characters in SQLite’s default BINARY collation.

Verification with EXPLAIN QUERY PLAN

EXPLAIN QUERY PLAN 
SELECT * FROM files 
WHERE path BETWEEN '[a' AND '[a' || char(0x10FFFF);

Output:

QUERY PLAN
`--SEARCH files USING COVERING INDEX files_by_path (path>? AND path<?)
2. Handling Directory Structures with Escaped Characters

For paths like /folderwith[[]brackets[]]/subdir/*, manually construct the range:

SELECT * FROM files 
WHERE path BETWEEN '/folderwith[[brackets[]]/subdir/' 
  AND '/folderwith[[brackets[]]/subdir/' || char(0x10FFFF);

This captures all paths under /folderwith[[brackets[]]/subdir/ while using the index.

3. Using Inequality Operators for Fine-Grained Control

For cases where BETWEEN is impractical, use > and < with the next character in the collation sequence. For example:

SELECT * FROM files 
WHERE path >= '/folderwith[brackets]/subdir/' 
  AND path < '/folderwith[brackets]/subdir0';
  • The 0 character follows / in ASCII, making this range inclusive of all subdirectories under /folderwith[brackets]/subdir/.
4. Parameter Binding for Dynamic Queries

When generating queries programmatically, bind parameters to avoid SQL injection and simplify range construction. Example in Go:

dir := "/folderwith[[brackets[]]/subdir/"
query := `
  SELECT * FROM files 
  WHERE path BETWEEN ? AND ? || char(0x10FFFF)
`
rows, err := db.Query(query, dir, dir)
5. Edge Cases and Collation-Specific Adjustments
  • Binary vs. Unicode Collations: The char(0x10FFFF) method assumes a BINARY collation. For ICU collations, use x'f48fbfbf' (UTF-8 for U+10FFFF).
  • Trailing Slashes: Ensure directory paths end with / to avoid matching sibling entries (e.g., folderwith[brackets]X).
6. Limitations and Tradeoffs
  • Readability: Manual range queries are less intuitive than GLOB.
  • Maintenance: Changes to collation or SQLite’s sorting logic could break range assumptions.
  • False Positives: The BETWEEN method may include unwanted entries, necessitating a secondary GLOB filter:
    SELECT * FROM files 
    WHERE path BETWEEN ? AND ? || char(0x10FFFF)
      AND path GLOB ?;
    

Final Recommendations and Best Practices

  1. Prefer BETWEEN for Static Patterns
    Use BETWEEN with char(0x10FFFF) for fixed directory structures. This approach is concise and leverages index acceleration.

  2. Combine Range Queries with Secondary Filtering
    Add a GLOB or LIKE clause after the BETWEEN to eliminate false positives without sacrificing index usage:

    SELECT * FROM files 
    WHERE path BETWEEN '/folder/' AND '/folder/' || char(0x10FFFF)
      AND path GLOB '/folder/*';
    
  3. Benchmark and Validate
    Always use EXPLAIN QUERY PLAN to confirm index usage and test queries on large datasets to ensure performance gains.

  4. Advocate for Future Optimizations
    File a feature request with the SQLite team to extend the GLOB optimization to handle escaped characters. Reference community discussions like this thread to highlight the use case.

By adopting these strategies, developers can maintain query performance while correctly escaping special characters in file paths or directory structures.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *