Optimizing Read-Only SQLite Tables with Duplicate Values and Index Overhead
Understanding the Core Challenge: Index Overhead in Single-Column Tables with Duplicates
Scenario Context
A developer manages a read-only SQLite table (CREATE TABLE t(value INT);
) containing over 100 million rows. The table implicitly includes the ROWID
column (SQLite’s default primary key) and an integer value
column. Queries like SELECT * FROM t WHERE value = ?
are slow without an index. Creating an index (CREATE INDEX idx ON t(value);
) drastically improves query performance but doubles the database size. The goal is to eliminate redundant storage while retaining fast query capabilities.
Key Observations
- Storage Overhead: The index duplicates the
value
andROWID
columns. For a table with two implicit columns (ROWID
andvalue
), the index effectively stores the same data again, sorted byvalue
. - Read-Only Constraint: Since the table is static, schema modifications or data reorganization are feasible without transactional overhead.
- Duplicate Values: The
value
column contains duplicates, preventing the use ofvalue
as a primary key.
Technical Constraints
- SQLite’s storage model for indexes inherently duplicates indexed columns and
ROWID
(or the table’s primary key). WITHOUT ROWID
tables require a unique primary key, which complicates handling duplicatevalue
entries.- Alternative schema designs must balance query performance, storage efficiency, and data integrity.
Root Causes of Storage Duplication and Performance Limitations
1. Implicit Rowid and Index Structure
Every SQLite table defaults to using ROWID
as its primary key unless defined otherwise. An index on value
stores pairs of (value, ROWID)
, sorted by value
. For a table with only ROWID
and value
, this duplicates the entire dataset.
2. Non-Unique Index Requirements
Since value
contains duplicates, the index cannot enforce uniqueness. SQLite’s B-tree indexes always include a unique key by appending ROWID
to non-unique indexes. This ensures each index entry is distinct but adds storage overhead.
3. WithoutRowID Tradeoffs
A WITHOUT ROWID
table uses the primary key as its storage structure. However, primary keys must be unique. If value
had no duplicates, defining value INTEGER PRIMARY KEY
would merge the table and index into one structure. With duplicates, this approach fails unless additional columns are added to the primary key.
4. Normalization vs. Denormalization
The presence of duplicates suggests the data might not be fully normalized. For example, if duplicate value
entries represent repeated observations, storing counts (e.g., (value, count)
) could reduce redundancy. However, this depends on whether individual records or aggregate statistics are needed.
Comprehensive Solutions for Minimizing Storage and Maximizing Query Speed
1. Schema Redesign with Composite Primary Keys
Objective: Merge the table and index into a single structure using WITHOUT ROWID
and a composite primary key.
Steps:
Add a Uniqueness Guarantor: Introduce a synthetic column (e.g.,
seq
) to create unique composite keys:CREATE TABLE t( value INT, seq INT, PRIMARY KEY(value, seq) ) WITHOUT ROWID;
seq
ensures uniqueness, allowing the primary key to serve as the index.- Storage is reduced because the table and index are unified.
Population Strategy (for static data):
INSERT INTO t(value, seq) SELECT value, ROWID FROM original_table;
- Use
ROWID
from the original table asseq
to preserve insertion order.
- Use
Tradeoffs:
- Pros: Eliminates the separate index; queries use the primary key B-tree.
- Cons: Adds a
seq
column, increasing per-row storage slightly.
2. Denormalization: Storing Counts Instead of Duplicates
Use Case: If duplicate value
entries represent frequency counts (e.g., log entries), replace individual rows with aggregated counts.
Implementation:
CREATE TABLE t_aggregate(
value INT PRIMARY KEY,
count INT
) WITHOUT ROWID;
INSERT INTO t_aggregate(value, count)
SELECT value, COUNT(*) FROM original_table GROUP BY value;
Query Adjustment:
Replace SELECT * FROM t WHERE value = ?
with SELECT value, count FROM t_aggregate WHERE value = ?
.
Tradeoffs:
- Pros: Dramatically reduces storage; queries remain fast via primary key.
- Cons: Loses individual record details; only applicable for aggregate use cases.
3. Index-As-Table Hack for Read-Only Datasets
Advanced Technique: Exploit SQLite’s internal index structure by:
- Creating the index.
- Dropping the original table.
- Querying the index directly.
Steps:
-- 1. Create the original table and index
CREATE TABLE t(value INT);
CREATE INDEX idx ON t(value);
-- 2. Export data to a temporary structure
CREATE TABLE tmp AS SELECT value, ROWID AS rowid FROM t;
-- 3. Drop the original table
DROP TABLE t;
-- 4. Rename the index to mimic the table
ALTER TABLE idx RENAME TO t;
Caveats:
- SQLite does not officially support querying indexes directly. This hack relies on renaming the index, which may break in future versions.
- Requires using
ROWID
in queries (e.g.,SELECT rowid, value FROM t WHERE value = ?
).
4. Pointer Encoding in Primary Key
Idea: Encode duplicates using a delimited primary key to simulate an index.
Schema:
CREATE TABLE t(
pk TEXT PRIMARY KEY, -- Format: "value:seq"
value INT GENERATED ALWAYS AS (CAST(SUBSTR(pk, 1, INSTR(pk, ':') - 1) AS INT)) VIRTUAL
) WITHOUT ROWID;
Population:
INSERT INTO t(pk)
VALUES
('1312220:1'),
('1312220:2'),
('1312220:3');
Querying:
SELECT * FROM t
WHERE pk BETWEEN '1312220:' AND '1312220:~';
Mechanics:
- The
BETWEEN
clause exploits lexical ordering to find all entries for a specificvalue
. - The virtual column
value
extracts the integer for external use.
Tradeoffs:
- Pros: Avoids a separate index; queries use the primary key.
- Cons: Requires string manipulation; complicates updates.
5. Leveraging Covering Indexes
Optimization: If the table includes additional columns, define a covering index to avoid row lookups.
Example:
For a table CREATE TABLE t(value INT, data TEXT);
, create:
CREATE INDEX idx ON t(value) WHERE data IS NULL;
- The
WHERE
clause creates a partial index ifdata
is alwaysNULL
.
Applicability:
- Only viable if other columns are unused or can be excluded from queries.
6. External Compression and Custom Formats
Last Resort: For purely read-only data, precompute a compressed binary file with sorted value
entries and use sqlite3_blob
for direct access.
Steps:
- Export
value
column to a sorted binary file. - Use a virtual table or application code to perform binary searches.
Example Code Snippet (C-like pseudocode):
FILE *f = fopen("values.bin", "rb");
int target = 1312220;
int low = 0, high = file_size / sizeof(int) - 1;
while (low <= high) {
int mid = (low + high) / 2;
fseek(f, mid * sizeof(int), SEEK_SET);
int current;
fread(¤t, sizeof(int), 1, f);
if (current < target) low = mid + 1;
else if (current > target) high = mid - 1;
else { /* Found */ break; }
}
Tradeoffs:
- Pros: Minimal storage; fastest possible lookups.
- Cons: Bypasses SQLite entirely; requires custom code.
Final Recommendations
- For Pure Lookups: Use the composite primary key approach (
WITHOUT ROWID
+seq
). - For Aggregate Queries: Switch to a
(value, count)
schema. - For Experimental Systems: Try the index-as-table hack with thorough testing.
- For External Integration: Implement binary search on an exported file if SQLite integration is optional.
By aligning the schema with access patterns and leveraging SQLite’s flexibility, developers can eliminate redundant storage while maintaining query performance.