Efficiently Retrieving SQLite Blob Chunks Overlapping a Byte Range
Understanding Chunk Overlap Queries and Index Utilization
Core Challenge: Selecting Blob Chunks Intersecting a Requested Byte Range
When storing large files as non-overlapping, contiguous chunks in SQLite, a common requirement is to retrieve all chunks that overlap a specific byte range (e.g., for partial file reads or streaming). The challenge arises when the requested range’s start and end do not align with existing chunk boundaries. A naive BETWEEN
clause fails because it only returns chunks where the offset
column lies strictly within the requested range. This ignores chunks that start before the range but extend into it or start within the range but extend beyond it. For example:
- A chunk starting at offset
1,048,576
(1MB) with length4,096
bytes covers1,048,576
–1,052,672
. - A request for bytes
1,048,583
–1,052,673
must include this chunk, butBETWEEN 1,048,583 AND 1,052,673
would miss it because1,048,576
is less than1,048,583
.
The solution requires a range overlap condition that accounts for both the chunk’s start (offset
) and end (offset + length(chunk)
). However, implementing this efficiently—leveraging the primary key index for fast lookups—is non-trivial.
Why Index Scans Fail with Range Overlap Conditions
The primary key (file_id, offset)
ensures chunks are stored in offset order for each file. Ideally, a query targeting a specific file_id
and offset range should use this index for a binary search or range scan. However, the overlap condition offset <= :upper AND (offset + length(chunk)) >= :lower
introduces two problems:
Non-Indexed Computed Column:
The termoffset + length(chunk)
is computed at runtime. SQLite cannot use an index to optimize conditions involving this value, as it is not stored in the index. The optimizer must scan all rows matchingfile_id = ? AND offset <= :upper
and then filter those whereoffset + length(chunk) >= :lower
.Index Selection Heuristics:
SQLite’s query planner may prioritize reducing the initial row set using the most selective condition. If theoffset <= :upper
clause significantly narrows the rows (e.g., when:upper
is small), the autoindex on(file_id, offset)
is used. However, this is not guaranteed, especially for large files with millions of chunks, where theoffset <= :upper
range could still include thousands of rows.
Example Query Plan Analysis:
EXPLAIN QUERY PLAN
SELECT offset FROM chunk
WHERE file_id = 1
AND (offset + length(chunk)) >= 1,048,583
AND offset < 1,052,673;
The output shows:
QUERY PLAN
`--SEARCH chunk USING INDEX sqlite_autoindex_chunk_1 (file_id=? AND offset<?)
This indicates SQLite uses the primary key index (sqlite_autoindex_chunk_1
) to find rows where file_id = 1
and offset < 1,052,673
, then applies the offset + length(chunk) >= 1,048,583
filter. For large files, this can still result in scanning many rows.
Optimizing Chunk Retrieval with Subqueries and Index Hints
To force efficient index usage, restructure the query to:
- Find the first chunk overlapping the lower bound using a subquery.
- Find the last chunk overlapping the upper bound using the primary key.
- Retrieve all chunks between these two offsets.
Step 1: Locate the Starting Chunk
The first chunk overlapping the requested range is the one with the largest offset
less than or equal to the lower bound. This can be found via:
SELECT MAX(offset) FROM chunk
WHERE file_id = :file_id AND offset <= :lower;
This subquery leverages the primary key index for an efficient O(log N)
search.
Step 2: Retrieve All Chunks Up to the Upper Bound
Combine the starting offset with the upper bound:
SELECT offset FROM chunk
WHERE file_id = :file_id
AND offset >= (SELECT MAX(offset) FROM chunk WHERE file_id = :file_id AND offset <= :lower)
AND offset < :upper;
The offset >= ...
condition ensures we start at the first overlapping chunk, while offset < :upper
limits the scan to chunks starting before the upper bound. The primary key index enables a range scan between these two offsets.
Query Plan Analysis:
EXPLAIN QUERY PLAN
SELECT offset FROM chunk
WHERE file_id = 1
AND offset >= (SELECT MAX(offset) FROM chunk WHERE file_id = 1 AND offset <= 1,048,583)
AND offset < 1,052,673;
Output:
QUERY PLAN
|--CO-ROUTINE (subquery-1)
| `--SEARCH chunk USING INDEX sqlite_autoindex_chunk_1 (file_id=? AND offset<?)
`--SEARCH chunk USING INDEX sqlite_autoindex_chunk_1 (file_id=? AND offset>? AND offset<?)
This shows two efficient index searches: one for the subquery and another for the main query’s range.
Step 3: Handling Edge Cases
- Empty Files: If no chunks exist for
file_id
, return nothing. - Holes Between Chunks: The solution assumes no gaps between chunks. If gaps are possible, wrap the subquery in
COALESCE(..., 0)
to default to offset0
.
Performance Comparison and Best Practices
Original Overlap Condition:
- Pros: Simple to write.
- Cons: May scan many rows if
offset <= :upper
is not selective. Worst-case complexity:O(N)
.
Subquery Approach:
- Pros: Uses index for both bounds. Worst-case complexity:
O(log N + K)
, whereK
is the number of chunks in the range. - Cons: Requires a correlated subquery, which adds slight overhead.
- Pros: Uses index for both bounds. Worst-case complexity:
Benchmark Scenario:
- Table with 1 million chunks for
file_id = 1
. - Requested range:
offset >= 500,000
andoffset < 500,100
.
Results:
- Overlap Condition: Scans all chunks from
offset = 0
tooffset = 500,100
(500,100 rows), then filters. - Subquery Approach: Performs 2 index searches (
MAX(offset)
andoffset >= ...
) and scans 100 rows.
Schema Design Considerations
Precompute Chunk End Offsets:
Add a generated columnend_offset
(SQLite 3.31+):CREATE TABLE chunk ( file_id INTEGER NOT NULL, offset INTEGER NOT NULL, chunk BLOB NOT NULL, end_offset INTEGER GENERATED ALWAYS AS (offset + length(chunk)) VIRTUAL, PRIMARY KEY (file_id, offset) );
Create an index on
(file_id, end_offset)
:CREATE INDEX idx_chunk_end ON chunk(file_id, end_offset);
Query:
SELECT offset FROM chunk WHERE file_id = :file_id AND offset <= :upper AND end_offset >= :lower;
This allows the optimizer to use both
offset
andend_offset
indexes.Chunk Size Uniformity:
Use fixed-size chunks (e.g., 1MB) to simplify range calculations. The starting chunk index becomesFLOOR(:lower / chunk_size)
, and the query reduces to a simple range scan.Covering Indexes:
Includelength(chunk)
in the primary key or a covering index if frequently used in overlap conditions:CREATE INDEX idx_chunk_covering ON chunk(file_id, offset, length(chunk));
Advanced Techniques for Large-Scale Systems
Partitioning by File ID:
For systems with millions of files, partition thechunk
table byfile_id
usingCHECK
constraints or separate tables. This reduces index size per file and improves locality.Batch Retrieval with CTEs:
Use Common Table Expressions (CTEs) to retrieve multiple ranges in a single query:WITH ranges(file_id, lower, upper) AS ( VALUES (1, 1, 1000), (2, 5000, 6000) ) SELECT c.offset, c.chunk FROM ranges r JOIN chunk c ON c.file_id = r.file_id AND c.offset >= (SELECT MAX(offset) FROM chunk WHERE file_id = r.file_id AND offset <= r.lower) AND c.offset < r.upper;
Approximate Range Queries:
If exact overlap is unnecessary (e.g., for previewing data), use probabilistic filters like Bloom filters or range metadata tables.
Conclusion
Efficiently retrieving blob chunks overlapping a byte range in SQLite requires careful query design to leverage primary key indexes. Avoid computed conditions in WHERE
clauses and instead use subqueries to pinpoint the start and end of the relevant chunk range. Precomputing chunk boundaries or using generated columns further optimizes performance for large datasets.