Extremely Slow SELECT ORDER BY Primary Key in Large SQLite Table


Understanding the Performance Impact of Primary Key Scans in Rowid Tables

The core issue revolves around unexpectedly slow performance when executing a SELECT query that orders results by a TEXT primary key in a large SQLite table (7 million rows, ~5GB). The schema defines the primary key as key TEXT PRIMARY KEY, which implicitly creates an internal rowid integer for row storage. The query SELECT key, sequence, flags, version, length(body), expiration FROM "kv_.reviews" ORDER BY key triggers a full table scan using the primary key index (sqlite_autoindex_kv_.reviews_1), but this results in 7 million rowid lookups scattered across the database file. These non-sequential disk reads (pread syscalls) cause significant thrashing, even with a 200MB cache, leading to a 10-minute execution time on modern hardware.

The problem arises because SQLite’s default storage model for tables with explicit primary keys (non-integer or non-WITHOUT ROWID) uses rowid-based b-tree structures. While the primary key index orders entries by key, the actual table data is stored in rowid order, which is unrelated to the key ordering. To retrieve non-indexed columns (e.g., sequence, flags), SQLite must perform a rowid lookup for each row after fetching the ordered keys from the index. This forces the database engine to jump between random locations in the file, overwhelming even high-speed SSDs due to the sheer volume of I/O operations. The presence of large BLOB columns exacerbates the issue, as their variable-length storage requires additional page traversals to locate subsequent fields.


Root Causes of Inefficient Ordered Scans in Rowid-Based Schemas

1. Misalignment Between Primary Key Index and Physical Row Order

In standard SQLite tables (without WITHOUT ROWID), the primary key is implemented as a unique index over an implicit rowid. The table’s data pages are organized by rowid, not the primary key. When querying with ORDER BY key, the optimizer uses the primary key index to retrieve keys in order but must then fetch each row’s data via rowid, which corresponds to arbitrary physical locations. This mismatch between logical key order and physical row order forces random I/O.

2. BLOB Column Placement and Overflow Page Traversal

The schema places BLOB columns (body, version, extra) before the expiration integer. SQLite stores rows as serialized records, with each field’s length encoded in a header. For large BLOBs, if the total row size exceeds the database’s page size minus reserved space, the BLOB data spills into overflow pages. To access columns after a BLOB, SQLite must parse the BLOB’s length from the header and skip over it. If the BLOB spans multiple overflow pages, the engine must traverse a linked list of pages to locate subsequent fields, adding latency.

3. Suboptimal Query Planner Decisions with Indexed ORDER BY

The query planner prioritizes using the primary key index to satisfy the ORDER BY clause, unaware that the resulting rowid lookups will trigger random I/O. It assumes indexed ordering is always faster, even when the cost of row retrieval dominates. The ANALYZE command, which gathers statistics for the query planner, may not resolve this because the primary key’s uniqueness and high cardinality make it appear optimal for sorting, masking the physical storage overhead.


Optimization Strategies for Ordered Scans in Large Tables

1. Convert to a WITHOUT ROWID Table for Clustered Primary Key Storage

Solution: Redefine the table to use WITHOUT ROWID, clustering the data pages by the primary key.
Steps:

CREATE TABLE "kv_.reviews" (
  key TEXT PRIMARY KEY,
  sequence INTEGER,
  flags INTEGER DEFAULT 0,
  version BLOB,
  body BLOB,
  extra BLOB,
  expiration INTEGER
) WITHOUT ROWID;

Mechanism: A WITHOUT ROWID table stores rows directly in the primary key b-tree, eliminating the rowid indirection. Data pages are ordered by key, so an ORDER BY key scan reads rows sequentially from disk.
Trade-offs:

  • Slightly larger storage footprint due to b-tree overhead.
  • Slower inserts/updates if keys are non-sequential (fragmentation).
  • Not all SQLite features (e.g., AUTOINCREMENT) are supported.

Performance Impact: Reduces the query runtime from 10 minutes to near-linear scan speeds (e.g., seconds), as seen in the user’s in-memory sorting workaround.

2. Force a Full Table Scan with In-Memory Sort Using ORDER BY +key

Solution: Modify the query to use ORDER BY +key, suppressing index usage.
Implementation:

SELECT key, sequence, flags, version, length(body), expiration 
FROM "kv_.reviews" 
ORDER BY +key;

Mechanism: The unary + operator converts key into an expression, preventing the query planner from using the primary key index for ordering. This forces a full table scan, which reads rows in physical (rowid) order, followed by an in-memory sort.
When to Use: When schema changes are impractical, and the working set fits in memory.
Caveats:

  • In-memory sorting requires O(n) memory (proportional to the result set size).
  • Full scans may still trigger I/O for overflow pages.

3. Reorder Columns to Minimize Overflow Page Access

Solution: Move frequently accessed non-BLOB columns (e.g., expiration) before BLOB columns.
Revised Schema:

CREATE TABLE "kv_.reviews" (
  key TEXT PRIMARY KEY,
  sequence INTEGER,
  flags INTEGER DEFAULT 0,
  expiration INTEGER,  -- Moved before BLOBs
  version BLOB,
  body BLOB,
  extra BLOB
);

Mechanism: Columns declared earlier in the schema are stored closer to the row header. By placing expiration before BLOB fields, SQLite can retrieve its value without parsing or skipping over BLOB data. This is especially impactful if BLOB values are large and span overflow pages.
Verification: Use sqlite3_analyzer or .dbinfo to check overflow page counts before/after reorganization.

4. Isolate Large BLOBs in Separate 1:1 Tables

Solution: Split BLOB columns into a secondary table linked by key.
Schema:

CREATE TABLE "kv_.reviews" (
  key TEXT PRIMARY KEY,
  sequence INTEGER,
  flags INTEGER DEFAULT 0,
  expiration INTEGER
) WITHOUT ROWID;

CREATE TABLE "kv_.review_blobs" (
  key TEXT PRIMARY KEY,
  version BLOB,
  body BLOB,
  extra BLOB,
  FOREIGN KEY(key) REFERENCES "kv_.reviews"(key)
) WITHOUT ROWID;

Mechanism: Isolating BLOBs ensures they do not inflate the main table’s row size or cause overflow pages. Queries targeting non-BLOB columns avoid unnecessary I/O.
Query Adjustment:

SELECT r.key, r.sequence, r.flags, b.version, length(b.body), r.expiration
FROM "kv_.reviews" r
LEFT JOIN "kv_.review_blobs" b ON r.key = b.key
ORDER BY r.key;

5. Increase Database Page Size and Cache Configuration

Solution: Adjust the page size to 16KB or 32KB and raise the cache size.
Steps:

sqlite3 database.db "PRAGMA page_size=16384; VACUUM;"
sqlite3 database.db "PRAGMA cache_size=-200000;"  -- 200MB cache

Mechanism: Larger pages reduce the likelihood of overflow pages by accommodating more rows per page. A larger cache improves hit rates for frequently accessed pages.
Considerations: Page size changes require a VACUUM. Balance larger pages against increased I/O for small reads.

6. Periodic VACUUM to Defragment Overflow Chains

Solution: Run VACUUM regularly to rebuild the database and collapse overflow chains.
Effectiveness: While VACUUM reorders pages to reduce fragmentation, it does not cluster rows by primary key in rowid tables. However, it can improve scan performance by consolidating overflow pages.
Automation: Schedule VACUUM during off-peak hours or after bulk data changes.


Final Recommendations:
For immediate relief, apply ORDER BY +key to bypass index-induced random I/O. Long-term, convert the table to WITHOUT ROWID and reorder columns to prioritize frequently accessed fields. Isolating BLOBs and tuning page/cache sizes provide additional gains. Always validate changes with EXPLAIN QUERY PLAN and real-world benchmarks.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *