Issue Overview: Slow Read Performance on a 2.5GB SQLite Database with Joins and Filters

A user reported significant performance degradation when querying a 2.5GB SQLite database containing three tables:

labels (15k rows, metadata for sensor labels)
plc_values (125M rows, time-series sensor readings linked to labels and reads)
reads (120k rows, timestamps for sensor readings)

The problematic query retrieves 8,000 rows by joining plc_values with labels and reads, filtering on two specific labels_id values. Execution time is ~7 seconds on an SSD-equipped machine, which the user considers unreasonable. Initial attempts to improve performance (indexes, cache size adjustments, in-memory database copies) yielded only marginal gains. The core issue revolves around inefficient schema design, missing or misconfigured indexes, and suboptimal query patterns.

Possible Causes: Schema Design Flaws, Index Misconfiguration, and Query Inefficiencies

1. Schema Design Limitations

1.1 Improper Use of Data Types

SQLite does not enforce strict data types, but schema declarations influence storage efficiency and query planning:

BIGINT and DATETIME are not native SQLite types. Declaring labels.id as BIGINT and reads.timestamp as DATETIME introduces ambiguity. SQLite stores these as INTEGER and TEXT, respectively, but implicit type conversions during joins or filters can disable index usage.
The plc_values.labels_id is declared as BIGINT, but the query filters it using string literals (e.g., 'ALGM_SGP13_Storing.Byte_0'). This mismatch forces SQLite to perform type conversions or full-table scans.

1.2 Missing Foreign Key Indexes

The plc_values table references labels.id and reads.id via labels_id and reads_id, but no indexes exist on these foreign keys. Without indexes, joins require full scans of the massive plc_values table.

1.3 Redundant Table Structure

The reads table stores timestamps separately from plc_values, forcing a join for every query. Storing timestamps directly in plc_values would eliminate one join and simplify queries.

2. Index Misconfiguration

2.1 Lack of Composite Indexes

The query filters on plc_values.labels_id and joins on reads.id and labels.id. Without a composite index on (labels_id, reads_id), SQLite must perform multiple index lookups or full scans.

2.2 Inefficient Index Types

The labels table’s Label column is used for filtering in the original query (via labels_id lookups), but there’s no unique index on labels.Label. This forces unnecessary scans when resolving label names to IDs.

3. Query Execution Plan Bottlenecks

3.1 Unbounded Time Range Scans

The query lacks a time filter (e.g., WHERE timestamp BETWEEN ...), requiring a full scan of plc_values even though only 8,000 rows are needed. SQLite’s LIMIT clause is applied after processing all matching rows, not during the scan.

3.2 Non-Covering Indexes

Existing indexes (if any) likely do not "cover" all columns required by the query, forcing additional lookups to retrieve timestamp and Value.

3.3 Suboptimal Join Order

SQLite’s query planner may choose to scan plc_values first (125M rows) before joining with smaller tables, amplifying I/O and CPU costs.

Troubleshooting Steps, Solutions & Fixes: Schema Optimization, Index Tuning, and Query Refactoring

1. Schema Optimization

1.1 Simplify Data Types and Remove Redundancy

Redesign tables to use SQLite-native types and eliminate unnecessary joins:

-- Drop redundant 'reads' table; store timestamps directly
CREATE TABLE labels (
  id INTEGER PRIMARY KEY,  -- INTEGER instead of BIGINT
  name TEXT UNIQUE NOT NULL,  -- Unique index on label name
  group TEXT
);

CREATE TABLE plc_values (
  label_id INTEGER NOT NULL REFERENCES labels(id),
  timestamp INTEGER NOT NULL,  -- Unix epoch (milliseconds)
  value REAL NOT NULL
);

Rationale: Storing timestamp directly in plc_values removes the need for a reads table. Using INTEGER for timestamps enables efficient range queries.

1.2 Add Foreign Key Indexes

Create indexes on all foreign key columns to accelerate joins:

CREATE INDEX idx_plc_values_label_id ON plc_values(label_id);
CREATE INDEX idx_plc_values_timestamp ON plc_values(timestamp);

2. Index Tuning

2.1 Composite Index for Common Query Patterns

For queries filtering on label_id and ordering by timestamp, create a composite index:

CREATE INDEX idx_plc_values_label_id_timestamp ON plc_values(label_id, timestamp);

This index allows SQLite to:

Quickly locate rows for a specific label_id.
Retrieve the most recent (or oldest) timestamp values without scanning the table.

2.2 Covering Indexes

Ensure indexes include all columns required by the query to avoid "bookmark lookups":

CREATE INDEX idx_plc_values_covering ON plc_values(label_id, timestamp, value);

This index covers all columns needed for SELECT lb.Label, pv.timestamp, pv.Value, eliminating the need to access the main table.

3. Query Refactoring

3.1 Filter by Time Range

Add a time filter to limit the number of rows processed:

SELECT lb.name, pv.timestamp, pv.value
FROM plc_values AS pv
JOIN labels AS lb ON lb.id = pv.label_id
WHERE lb.name IN ('ALGM_SGP13_Storing.Byte_0', 'FIC_PB102C_Dichte.MW')
  AND pv.timestamp BETWEEN 1710000000000 AND 1710060000000  -- 24-hour range
ORDER BY pv.timestamp  -- Ensure deterministic LIMIT
LIMIT 8000;

Rationale: Restricting the time range reduces the working dataset. ORDER BY ensures a predictable row order.

3.2 Use Materialized Views for Frequent Queries

For dashboards querying the same subset of labels repeatedly, create a materialized view:

CREATE VIEW dashboard_data AS
SELECT lb.name, pv.timestamp, pv.value
FROM plc_values AS pv
JOIN labels AS lb ON lb.id = pv.label_id
WHERE lb.name IN ('ALGM_SGP13_Storing.Byte_0', 'FIC_PB102C_Dichte.MW');

Query the view instead of the base tables:

SELECT * FROM dashboard_data
WHERE timestamp BETWEEN ... 
LIMIT 8000;

4. Performance Diagnostics

4.1 Analyze the Query Plan

Use EXPLAIN QUERY PLAN to identify full scans or inefficient joins:

EXPLAIN QUERY PLAN
SELECT lb.name, pv.timestamp, pv.value
FROM plc_values AS pv
JOIN labels AS lb ON lb.id = pv.label_id
WHERE lb.name IN (...)
LIMIT 8000;

Expected Output:

SEARCH TABLE plc_values USING INDEX idx_plc_values_label_id_timestamp (label_id=?)
SEARCH TABLE labels USING INTEGER PRIMARY KEY (rowid=?)

If the output shows SCAN TABLE plc_values, the indexes are missing or not being used.

4.2 Profile I/O and Cache Usage

Monitor page cache hits/misses using SQLite’s sqlite3_stmt_status:

import sqlite3
conn = sqlite3.connect('test.db')
conn.execute('PRAGMA cache_size = -10000;')  -- 10MB cache
cursor = conn.execute('SELECT * FROM ...')
print(cursor.execute('PRAGMA cache_stats;').fetchall())

Interpretation: High cache misses indicate insufficient memory. Increase cache_size to reduce disk I/O.

5. Configuration Tuning

5.1 Adjust Cache and Memory Settings

Page Size: Use a larger page size (e.g., 4096 bytes) to reduce I/O operations:
```
PRAGMA page_size = 4096;  -- Must be set before creating tables
```

Cache Size: Allocate 10-20% of available RAM to SQLite:

PRAGMA cache_size = -100000;  -- 100,000 pages (4096 * 100k = 409MB)

5.2 Disable WAL for Read-Heavy Workloads

While Write-Ahead Logging (WAL) improves write concurrency, it can introduce read overhead in some cases:

PRAGMA journal_mode = DELETE;  -- Switch back to rollback journal

Test read performance under both modes to determine the optimal configuration.

6. Advanced Optimizations

6.1 Partition Data by Time

Split plc_values into monthly/quarterly tables (e.g., plc_values_2024Q1) to reduce per-query dataset size. Use ATTACH DATABASE or a union view to query across partitions.

6.2 Use In-Memory Databases for Hot Data

Cache frequently accessed time ranges in an in-memory database:

import sqlite3
disk_conn = sqlite3.connect('historical.db')
mem_conn = sqlite3.connect(':memory:')
disk_conn.backup(mem_conn)  -- Copy entire DB to memory

For incremental updates:

mem_conn.execute('ATTACH DATABASE "historical.db" AS disk')
mem_conn.execute('INSERT INTO main.plc_values SELECT * FROM disk.plc_values WHERE timestamp > ?', (last_cached_time,))

6.3 Leverage SQLite’s RTREE for Time Series

For geospatial or time-series data, use SQLite’s RTREE extension to index timestamp ranges:

CREATE VIRTUAL TABLE plc_values_rtree USING rtree(
  id,              -- Primary key
  start_time,      -- Minimum timestamp
  end_time,        -- Maximum timestamp
  +label_id        -- Additional payload
);

Insert time intervals and query with:

SELECT * FROM plc_values_rtree
WHERE start_time <= 1710060000000 AND end_time >= 1710000000000
  AND label_id IN (52, 163);

7. Example Benchmark

After applying schema and index changes, a comparable query on a 2.5GB database should execute in <500ms:

-- Optimized schema
CREATE TABLE labels (
  id INTEGER PRIMARY KEY,
  name TEXT UNIQUE NOT NULL
);
CREATE TABLE plc_values (
  label_id INTEGER NOT NULL REFERENCES labels(id),
  timestamp INTEGER NOT NULL,
  value REAL NOT NULL
);
CREATE INDEX idx_plc_values_label_id_time ON plc_values(label_id, timestamp);

-- Optimized query
SELECT lb.name, pv.timestamp, pv.value
FROM plc_values pv
JOIN labels lb ON lb.id = pv.label_id
WHERE lb.name IN ('ALGM_SGP13_Storing.Byte_0', 'FIC_PB102C_Dichte.MW')
  AND pv.timestamp BETWEEN 1710000000000 AND 1710060000000
ORDER BY pv.timestamp
LIMIT 8000;

Expected Execution Plan:

SEARCH TABLE labels USING INDEX sqlite_autoindex_labels_1 (name=?)
SEARCH TABLE plc_values USING INDEX idx_plc_values_label_id_time (label_id=? AND timestamp>? AND timestamp<?)

This leverages indexes for both the IN filter and time range, reducing the dataset to exactly 8,000 rows without full scans.

Conclusion

Achieving sub-second query times on a 2.5GB SQLite database is feasible with proper schema design, targeted indexing, and query optimization. Key takeaways include:

Use INTEGER primary keys and avoid type mismatches in joins/filters.
Create composite indexes aligned with query patterns.
Eliminate unnecessary joins by denormalizing timestamps.
Constrain queries with time ranges to limit scanned rows.
Profile and tune SQLite configuration parameters (cache size, page size).

By systematically addressing schema flaws, index coverage, and query structure, the reported 7-second query can be reduced to <500ms, even on modest hardware.