GLOB Index Acceleration Fails with Escaped Special Characters in SQLite
Understanding GLOB Pattern Escaping and Index Utilization Challenges
Issue Overview: Escaped Special Characters in GLOB Patterns Prevent Index Acceleration
The core issue revolves around SQLite’s inability to utilize indices for GLOB
queries when the pattern includes escaped special characters such as *
, [
, or ]
. This limitation becomes apparent when directory paths or filenames contain these characters, requiring escaping via square brackets (e.g., [[]
to match a literal [
). While escaping ensures correct query results, it disables the query planner’s ability to leverage indices for acceleration, resulting in full table scans (SCAN files
) instead of index-optimized searches (SEARCH files USING INDEX
).
Example Breakdown
Consider a table files(path TEXT)
with an index files_by_path
on path
.
- A query for
path GLOB 'a*'
(all paths starting with "a") uses the index:QUERY PLAN `--SEARCH files USING COVERING INDEX files_by_path (path>? AND path<?)
- Escaping a character (e.g.,
path GLOB '[a]*'
) or querying paths starting with*
(e.g.,path GLOB '[*]*'
) results in a full scan:QUERY PLAN `--SCAN files
The problem arises because SQLite’s query planner does not recognize that [a]
in a GLOB
pattern is semantically equivalent to the unescaped a
. This prevents the index from being used to narrow down the search range. In real-world scenarios (e.g., querying directory structures with paths like /folderwith[[]brackets[]]/subdir/*
), this leads to inefficient scans over large datasets.
Root Causes: How SQLite’s GLOB Optimization Works (and When It Doesn’t)
LIKE/GLOB Optimization Constraints
SQLite’s index acceleration forGLOB
is derived from itsLIKE
optimization rules (documented here). These rules allow index usage only for patterns that:- Begin with a literal prefix (e.g.,
'a*'
). - Do not contain unescaped special characters (
*
,?
,[
,]
) in the literal prefix.
Escaping a character with
[ ]
(e.g.,'[a]*'
) breaks the literal prefix recognition, as the optimizer treats[a]
as a character set, not a literala
.- Begin with a literal prefix (e.g.,
Pattern Parsing Limitations
The query planner does not simplify patterns containing escaped characters. For example:'[a]*'
is not simplified to'a*'
despite[a]
being a single-character set.- Patterns with nested escapes (e.g.,
'[[]*'
to match paths starting with[
) are not parsed to extract the literal prefix.
Collation and UTF-8 Handling
Escaped characters in multi-byte UTF-8 sequences complicate pattern analysis. The optimizer would need to validate that[ ]
encloses a single valid UTF-8 codepoint, which is not implemented.
Solutions: Manual Index Range Queries and Pattern Rewriting
1. Rewrite GLOB Queries Using Index-Friendly Range Conditions
Replace GLOB
with BETWEEN
or inequality conditions to force index usage. This requires deriving upper and lower bounds for the search range.
Example: Querying Paths Starting with '[a]'
Instead of:
SELECT * FROM files WHERE path GLOB '[a]*';
Use:
SELECT * FROM files
WHERE path BETWEEN '[a' AND '[a' || char(0x10FFFF);
BETWEEN
defines a range from'[a'
(the smallest string starting with[a
) to'[a'
concatenated with the highest possible UTF-8 character (0x10FFFF
), ensuring all paths starting with[a
are included.- The
char(0x10FFFF)
function appends the Unicode replacement character, which sorts after all valid characters in SQLite’s defaultBINARY
collation.
Verification with EXPLAIN QUERY PLAN
EXPLAIN QUERY PLAN
SELECT * FROM files
WHERE path BETWEEN '[a' AND '[a' || char(0x10FFFF);
Output:
QUERY PLAN
`--SEARCH files USING COVERING INDEX files_by_path (path>? AND path<?)
2. Handling Directory Structures with Escaped Characters
For paths like /folderwith[[]brackets[]]/subdir/*
, manually construct the range:
SELECT * FROM files
WHERE path BETWEEN '/folderwith[[brackets[]]/subdir/'
AND '/folderwith[[brackets[]]/subdir/' || char(0x10FFFF);
This captures all paths under /folderwith[[brackets[]]/subdir/
while using the index.
3. Using Inequality Operators for Fine-Grained Control
For cases where BETWEEN
is impractical, use >
and <
with the next character in the collation sequence. For example:
SELECT * FROM files
WHERE path >= '/folderwith[brackets]/subdir/'
AND path < '/folderwith[brackets]/subdir0';
- The
0
character follows/
in ASCII, making this range inclusive of all subdirectories under/folderwith[brackets]/subdir/
.
4. Parameter Binding for Dynamic Queries
When generating queries programmatically, bind parameters to avoid SQL injection and simplify range construction. Example in Go:
dir := "/folderwith[[brackets[]]/subdir/"
query := `
SELECT * FROM files
WHERE path BETWEEN ? AND ? || char(0x10FFFF)
`
rows, err := db.Query(query, dir, dir)
5. Edge Cases and Collation-Specific Adjustments
- Binary vs. Unicode Collations: The
char(0x10FFFF)
method assumes aBINARY
collation. ForICU
collations, usex'f48fbfbf'
(UTF-8 forU+10FFFF
). - Trailing Slashes: Ensure directory paths end with
/
to avoid matching sibling entries (e.g.,folderwith[brackets]X
).
6. Limitations and Tradeoffs
- Readability: Manual range queries are less intuitive than
GLOB
. - Maintenance: Changes to collation or SQLite’s sorting logic could break range assumptions.
- False Positives: The
BETWEEN
method may include unwanted entries, necessitating a secondaryGLOB
filter:SELECT * FROM files WHERE path BETWEEN ? AND ? || char(0x10FFFF) AND path GLOB ?;
Final Recommendations and Best Practices
Prefer
BETWEEN
for Static Patterns
UseBETWEEN
withchar(0x10FFFF)
for fixed directory structures. This approach is concise and leverages index acceleration.Combine Range Queries with Secondary Filtering
Add aGLOB
orLIKE
clause after theBETWEEN
to eliminate false positives without sacrificing index usage:SELECT * FROM files WHERE path BETWEEN '/folder/' AND '/folder/' || char(0x10FFFF) AND path GLOB '/folder/*';
Benchmark and Validate
Always useEXPLAIN QUERY PLAN
to confirm index usage and test queries on large datasets to ensure performance gains.Advocate for Future Optimizations
File a feature request with the SQLite team to extend theGLOB
optimization to handle escaped characters. Reference community discussions like this thread to highlight the use case.
By adopting these strategies, developers can maintain query performance while correctly escaping special characters in file paths or directory structures.