Extremely Slow SELECT ORDER BY Primary Key in Large SQLite Table
Understanding the Performance Impact of Primary Key Scans in Rowid Tables
The core issue revolves around unexpectedly slow performance when executing a SELECT
query that orders results by a TEXT
primary key in a large SQLite table (7 million rows, ~5GB). The schema defines the primary key as key TEXT PRIMARY KEY
, which implicitly creates an internal rowid
integer for row storage. The query SELECT key, sequence, flags, version, length(body), expiration FROM "kv_.reviews" ORDER BY key
triggers a full table scan using the primary key index (sqlite_autoindex_kv_.reviews_1
), but this results in 7 million rowid
lookups scattered across the database file. These non-sequential disk reads (pread
syscalls) cause significant thrashing, even with a 200MB cache, leading to a 10-minute execution time on modern hardware.
The problem arises because SQLite’s default storage model for tables with explicit primary keys (non-integer or non-WITHOUT ROWID
) uses rowid
-based b-tree structures. While the primary key index orders entries by key
, the actual table data is stored in rowid
order, which is unrelated to the key
ordering. To retrieve non-indexed columns (e.g., sequence
, flags
), SQLite must perform a rowid
lookup for each row after fetching the ordered keys from the index. This forces the database engine to jump between random locations in the file, overwhelming even high-speed SSDs due to the sheer volume of I/O operations. The presence of large BLOB
columns exacerbates the issue, as their variable-length storage requires additional page traversals to locate subsequent fields.
Root Causes of Inefficient Ordered Scans in Rowid-Based Schemas
1. Misalignment Between Primary Key Index and Physical Row Order
In standard SQLite tables (without WITHOUT ROWID
), the primary key is implemented as a unique index over an implicit rowid
. The table’s data pages are organized by rowid
, not the primary key. When querying with ORDER BY key
, the optimizer uses the primary key index to retrieve keys in order but must then fetch each row’s data via rowid
, which corresponds to arbitrary physical locations. This mismatch between logical key order and physical row order forces random I/O.
2. BLOB Column Placement and Overflow Page Traversal
The schema places BLOB
columns (body
, version
, extra
) before the expiration
integer. SQLite stores rows as serialized records, with each field’s length encoded in a header. For large BLOBs
, if the total row size exceeds the database’s page size minus reserved space, the BLOB
data spills into overflow pages. To access columns after a BLOB
, SQLite must parse the BLOB
’s length from the header and skip over it. If the BLOB
spans multiple overflow pages, the engine must traverse a linked list of pages to locate subsequent fields, adding latency.
3. Suboptimal Query Planner Decisions with Indexed ORDER BY
The query planner prioritizes using the primary key index to satisfy the ORDER BY
clause, unaware that the resulting rowid
lookups will trigger random I/O. It assumes indexed ordering is always faster, even when the cost of row retrieval dominates. The ANALYZE
command, which gathers statistics for the query planner, may not resolve this because the primary key’s uniqueness and high cardinality make it appear optimal for sorting, masking the physical storage overhead.
Optimization Strategies for Ordered Scans in Large Tables
1. Convert to a WITHOUT ROWID
Table for Clustered Primary Key Storage
Solution: Redefine the table to use WITHOUT ROWID
, clustering the data pages by the primary key.
Steps:
CREATE TABLE "kv_.reviews" (
key TEXT PRIMARY KEY,
sequence INTEGER,
flags INTEGER DEFAULT 0,
version BLOB,
body BLOB,
extra BLOB,
expiration INTEGER
) WITHOUT ROWID;
Mechanism: A WITHOUT ROWID
table stores rows directly in the primary key b-tree, eliminating the rowid
indirection. Data pages are ordered by key
, so an ORDER BY key
scan reads rows sequentially from disk.
Trade-offs:
- Slightly larger storage footprint due to b-tree overhead.
- Slower inserts/updates if keys are non-sequential (fragmentation).
- Not all SQLite features (e.g.,
AUTOINCREMENT
) are supported.
Performance Impact: Reduces the query runtime from 10 minutes to near-linear scan speeds (e.g., seconds), as seen in the user’s in-memory sorting workaround.
2. Force a Full Table Scan with In-Memory Sort Using ORDER BY +key
Solution: Modify the query to use ORDER BY +key
, suppressing index usage.
Implementation:
SELECT key, sequence, flags, version, length(body), expiration
FROM "kv_.reviews"
ORDER BY +key;
Mechanism: The unary +
operator converts key
into an expression, preventing the query planner from using the primary key index for ordering. This forces a full table scan, which reads rows in physical (rowid
) order, followed by an in-memory sort.
When to Use: When schema changes are impractical, and the working set fits in memory.
Caveats:
- In-memory sorting requires
O(n)
memory (proportional to the result set size). - Full scans may still trigger I/O for overflow pages.
3. Reorder Columns to Minimize Overflow Page Access
Solution: Move frequently accessed non-BLOB
columns (e.g., expiration
) before BLOB
columns.
Revised Schema:
CREATE TABLE "kv_.reviews" (
key TEXT PRIMARY KEY,
sequence INTEGER,
flags INTEGER DEFAULT 0,
expiration INTEGER, -- Moved before BLOBs
version BLOB,
body BLOB,
extra BLOB
);
Mechanism: Columns declared earlier in the schema are stored closer to the row header. By placing expiration
before BLOB
fields, SQLite can retrieve its value without parsing or skipping over BLOB
data. This is especially impactful if BLOB
values are large and span overflow pages.
Verification: Use sqlite3_analyzer
or .dbinfo
to check overflow page counts before/after reorganization.
4. Isolate Large BLOBs in Separate 1:1 Tables
Solution: Split BLOB
columns into a secondary table linked by key
.
Schema:
CREATE TABLE "kv_.reviews" (
key TEXT PRIMARY KEY,
sequence INTEGER,
flags INTEGER DEFAULT 0,
expiration INTEGER
) WITHOUT ROWID;
CREATE TABLE "kv_.review_blobs" (
key TEXT PRIMARY KEY,
version BLOB,
body BLOB,
extra BLOB,
FOREIGN KEY(key) REFERENCES "kv_.reviews"(key)
) WITHOUT ROWID;
Mechanism: Isolating BLOB
s ensures they do not inflate the main table’s row size or cause overflow pages. Queries targeting non-BLOB
columns avoid unnecessary I/O.
Query Adjustment:
SELECT r.key, r.sequence, r.flags, b.version, length(b.body), r.expiration
FROM "kv_.reviews" r
LEFT JOIN "kv_.review_blobs" b ON r.key = b.key
ORDER BY r.key;
5. Increase Database Page Size and Cache Configuration
Solution: Adjust the page size to 16KB or 32KB and raise the cache size.
Steps:
sqlite3 database.db "PRAGMA page_size=16384; VACUUM;"
sqlite3 database.db "PRAGMA cache_size=-200000;" -- 200MB cache
Mechanism: Larger pages reduce the likelihood of overflow pages by accommodating more rows per page. A larger cache improves hit rates for frequently accessed pages.
Considerations: Page size changes require a VACUUM
. Balance larger pages against increased I/O for small reads.
6. Periodic VACUUM
to Defragment Overflow Chains
Solution: Run VACUUM
regularly to rebuild the database and collapse overflow chains.
Effectiveness: While VACUUM
reorders pages to reduce fragmentation, it does not cluster rows by primary key in rowid
tables. However, it can improve scan performance by consolidating overflow pages.
Automation: Schedule VACUUM
during off-peak hours or after bulk data changes.
Final Recommendations:
For immediate relief, apply ORDER BY +key
to bypass index-induced random I/O. Long-term, convert the table to WITHOUT ROWID
and reorder columns to prioritize frequently accessed fields. Isolating BLOB
s and tuning page/cache sizes provide additional gains. Always validate changes with EXPLAIN QUERY PLAN
and real-world benchmarks.