Optimizing Read-Only SQLite Tables with Duplicate Values and Index Overhead


Understanding the Core Challenge: Index Overhead in Single-Column Tables with Duplicates

Scenario Context
A developer manages a read-only SQLite table (CREATE TABLE t(value INT);) containing over 100 million rows. The table implicitly includes the ROWID column (SQLite’s default primary key) and an integer value column. Queries like SELECT * FROM t WHERE value = ? are slow without an index. Creating an index (CREATE INDEX idx ON t(value);) drastically improves query performance but doubles the database size. The goal is to eliminate redundant storage while retaining fast query capabilities.

Key Observations

  1. Storage Overhead: The index duplicates the value and ROWID columns. For a table with two implicit columns (ROWID and value), the index effectively stores the same data again, sorted by value.
  2. Read-Only Constraint: Since the table is static, schema modifications or data reorganization are feasible without transactional overhead.
  3. Duplicate Values: The value column contains duplicates, preventing the use of value as a primary key.

Technical Constraints

  • SQLite’s storage model for indexes inherently duplicates indexed columns and ROWID (or the table’s primary key).
  • WITHOUT ROWID tables require a unique primary key, which complicates handling duplicate value entries.
  • Alternative schema designs must balance query performance, storage efficiency, and data integrity.

Root Causes of Storage Duplication and Performance Limitations

1. Implicit Rowid and Index Structure
Every SQLite table defaults to using ROWID as its primary key unless defined otherwise. An index on value stores pairs of (value, ROWID), sorted by value. For a table with only ROWID and value, this duplicates the entire dataset.

2. Non-Unique Index Requirements
Since value contains duplicates, the index cannot enforce uniqueness. SQLite’s B-tree indexes always include a unique key by appending ROWID to non-unique indexes. This ensures each index entry is distinct but adds storage overhead.

3. WithoutRowID Tradeoffs
A WITHOUT ROWID table uses the primary key as its storage structure. However, primary keys must be unique. If value had no duplicates, defining value INTEGER PRIMARY KEY would merge the table and index into one structure. With duplicates, this approach fails unless additional columns are added to the primary key.

4. Normalization vs. Denormalization
The presence of duplicates suggests the data might not be fully normalized. For example, if duplicate value entries represent repeated observations, storing counts (e.g., (value, count)) could reduce redundancy. However, this depends on whether individual records or aggregate statistics are needed.


Comprehensive Solutions for Minimizing Storage and Maximizing Query Speed

1. Schema Redesign with Composite Primary Keys

Objective: Merge the table and index into a single structure using WITHOUT ROWID and a composite primary key.

Steps:

  • Add a Uniqueness Guarantor: Introduce a synthetic column (e.g., seq) to create unique composite keys:

    CREATE TABLE t(
      value INT,
      seq INT,
      PRIMARY KEY(value, seq)
    ) WITHOUT ROWID;
    
    • seq ensures uniqueness, allowing the primary key to serve as the index.
    • Storage is reduced because the table and index are unified.
  • Population Strategy (for static data):

    INSERT INTO t(value, seq)
    SELECT value, ROWID FROM original_table;
    
    • Use ROWID from the original table as seq to preserve insertion order.

Tradeoffs:

  • Pros: Eliminates the separate index; queries use the primary key B-tree.
  • Cons: Adds a seq column, increasing per-row storage slightly.

2. Denormalization: Storing Counts Instead of Duplicates

Use Case: If duplicate value entries represent frequency counts (e.g., log entries), replace individual rows with aggregated counts.

Implementation:

CREATE TABLE t_aggregate(
  value INT PRIMARY KEY,
  count INT
) WITHOUT ROWID;

INSERT INTO t_aggregate(value, count)
SELECT value, COUNT(*) FROM original_table GROUP BY value;

Query Adjustment:
Replace SELECT * FROM t WHERE value = ? with SELECT value, count FROM t_aggregate WHERE value = ?.

Tradeoffs:

  • Pros: Dramatically reduces storage; queries remain fast via primary key.
  • Cons: Loses individual record details; only applicable for aggregate use cases.

3. Index-As-Table Hack for Read-Only Datasets

Advanced Technique: Exploit SQLite’s internal index structure by:

  1. Creating the index.
  2. Dropping the original table.
  3. Querying the index directly.

Steps:

-- 1. Create the original table and index
CREATE TABLE t(value INT);
CREATE INDEX idx ON t(value);

-- 2. Export data to a temporary structure
CREATE TABLE tmp AS SELECT value, ROWID AS rowid FROM t;

-- 3. Drop the original table
DROP TABLE t;

-- 4. Rename the index to mimic the table
ALTER TABLE idx RENAME TO t;

Caveats:

  • SQLite does not officially support querying indexes directly. This hack relies on renaming the index, which may break in future versions.
  • Requires using ROWID in queries (e.g., SELECT rowid, value FROM t WHERE value = ?).

4. Pointer Encoding in Primary Key

Idea: Encode duplicates using a delimited primary key to simulate an index.

Schema:

CREATE TABLE t(
  pk TEXT PRIMARY KEY,  -- Format: "value:seq"
  value INT GENERATED ALWAYS AS (CAST(SUBSTR(pk, 1, INSTR(pk, ':') - 1) AS INT)) VIRTUAL
) WITHOUT ROWID;

Population:

INSERT INTO t(pk)
VALUES
  ('1312220:1'),
  ('1312220:2'),
  ('1312220:3');

Querying:

SELECT * FROM t
WHERE pk BETWEEN '1312220:' AND '1312220:~';

Mechanics:

  • The BETWEEN clause exploits lexical ordering to find all entries for a specific value.
  • The virtual column value extracts the integer for external use.

Tradeoffs:

  • Pros: Avoids a separate index; queries use the primary key.
  • Cons: Requires string manipulation; complicates updates.

5. Leveraging Covering Indexes

Optimization: If the table includes additional columns, define a covering index to avoid row lookups.

Example:
For a table CREATE TABLE t(value INT, data TEXT);, create:

CREATE INDEX idx ON t(value) WHERE data IS NULL;
  • The WHERE clause creates a partial index if data is always NULL.

Applicability:

  • Only viable if other columns are unused or can be excluded from queries.

6. External Compression and Custom Formats

Last Resort: For purely read-only data, precompute a compressed binary file with sorted value entries and use sqlite3_blob for direct access.

Steps:

  1. Export value column to a sorted binary file.
  2. Use a virtual table or application code to perform binary searches.

Example Code Snippet (C-like pseudocode):

FILE *f = fopen("values.bin", "rb");
int target = 1312220;
int low = 0, high = file_size / sizeof(int) - 1;
while (low <= high) {
  int mid = (low + high) / 2;
  fseek(f, mid * sizeof(int), SEEK_SET);
  int current;
  fread(&current, sizeof(int), 1, f);
  if (current < target) low = mid + 1;
  else if (current > target) high = mid - 1;
  else { /* Found */ break; }
}

Tradeoffs:

  • Pros: Minimal storage; fastest possible lookups.
  • Cons: Bypasses SQLite entirely; requires custom code.

Final Recommendations

  1. For Pure Lookups: Use the composite primary key approach (WITHOUT ROWID + seq).
  2. For Aggregate Queries: Switch to a (value, count) schema.
  3. For Experimental Systems: Try the index-as-table hack with thorough testing.
  4. For External Integration: Implement binary search on an exported file if SQLite integration is optional.

By aligning the schema with access patterns and leveraging SQLite’s flexibility, developers can eliminate redundant storage while maintaining query performance.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *