Optimizing SQLite Bulk CSV Imports with Concurrent Read Access
Issue Overview: Slow Repeated Bulk CSV Imports in WAL Mode
The core challenge involves efficiently importing a 10 GB CSV file (containing 20 million rows) into an SQLite database while ensuring read operations can proceed without blocking. The initial import completes in 4 minutes, but subsequent imports degrade to 30 minutes despite using Write-Ahead Logging (WAL) mode. The schema includes a table foo
with nine columns and an index on the ip
column. The process involves deleting existing data, recreating the table structure, importing CSV data, and rebuilding the index. Key symptoms include performance degradation on repeated imports, suboptimal transaction management, and contention between write-heavy operations and concurrent reads.
WAL mode reduces write-read contention by allowing readers to access the database while writers append changes to a separate log. However, large bulk imports introduce complexities: index maintenance, autovacuum inefficiencies, and WAL file growth. The DELETE FROM foo
command marks rows as free space without reclaiming storage, leading to database bloat. Subsequent imports must navigate fragmented pages, increasing I/O overhead. The CREATE INDEX
command after import forces a full table scan and index build, which becomes slower as the database grows. The .import
command in the SQLite shell uses individual INSERT
statements under the hood, generating excessive transaction overhead.
Possible Causes: Indexing, Fragmentation, and Transaction Overhead
Index Maintenance During Bulk Operations: Creating or dropping indexes during repeated imports introduces non-linear scalability. The
CREATE INDEX
command after.import
rebuilds the index from scratch, which requires sorting all 20 million rows. This operation scales with O(n log n) complexity, becoming disproportionately slow as data volume increases.Database Fragmentation and Free Page Management: The
DELETE FROM foo
command does not reduce the database size; it marks pages as free for reuse. Over time, free space becomes fragmented, forcing the database to seek non-contiguous pages during bulk inserts. This increases I/O latency, especially on rotational disks or cloud storage with higher seek times.Implicit Transaction Boundaries in
.import
: The SQLite shell’s.import
command splits CSV rows into individualINSERT
statements. Without explicit batching, eachINSERT
operates in an autocommit transaction, triggering frequent WAL flushes and checkpointing. This generates excessive write amplification, particularly when combined with WAL mode’s append-only design.Insufficient Cache Configuration: The default
cache_size
(2000 pages, ~2 MB) is inadequate for 20 million rows. Small caches force SQLite to evict pages prematurely, increasing disk I/O during index builds and data imports.WAL File Growth and Checkpoint Stalls: Large imports append data to the WAL file, which must be checkpointed into the main database. If readers hold open transactions, checkpointing stalls, causing the WAL file to grow unbounded. This increases commit latency and read times due to longer WAL replay.
Schema Definition and Type Affinity: The
CREATE TABLE
statement usesTEXT
for all fields exceptnum1
andnum2
. Implicit type conversions during CSV parsing can slow down inserts, especially if numeric fields are stored as text.
Troubleshooting Steps, Solutions & Fixes
1. Pre-Import Optimization: Schema and Index Management
Defer Index Creation: Build indexes after data import to avoid incremental index updates. Replace
CREATE INDEX idx ON foo(ip)
with a two-phase process:-- Phase 1: Import data without indexes PRAGMA journal_mode=WAL; PRAGMA cache_size = -1000000; -- 1 GB cache BEGIN; DROP INDEX IF EXISTS idx; DELETE FROM foo; COMMIT; VACUUM; -- Reclaim free space .mode csv .import sample.csv foo -- Phase 2: Build indexes after data is loaded BEGIN; CREATE INDEX idx ON foo(ip); COMMIT;
The
VACUUM
command rebuilds the database file, eliminating fragmentation.Use
WITHOUT ROWID
for Index-Intensive Tables: If queries frequently access columns via theip
index, consider aWITHOUT ROWID
table to cluster data by the index:CREATE TABLE IF NOT EXISTS foo( ip TEXT PRIMARY KEY, field1 TEXT, ..., field6 TEXT ) WITHOUT ROWID;
This eliminates the separate rowid storage and clusters rows by
ip
, reducing index lookup overhead.
2. Transaction and I/O Optimization
Batch Inserts with Explicit Transactions: Replace
.import
with a custom script that batches inserts into multi-row transactions. For example, using Python:import sqlite3, csv conn = sqlite3.connect('database.db') cursor = conn.cursor() cursor.execute("PRAGMA cache_size = -1000000") cursor.execute("BEGIN") with open('sample.csv', 'r') as f: reader = csv.reader(f) batch = [] for row in reader: batch.append(row) if len(batch) == 100000: # Commit every 100k rows cursor.executemany("INSERT INTO foo VALUES (?,?,?,?,?,?,?,?,?)", batch) cursor.execute("COMMIT; BEGIN") batch = [] if batch: cursor.executemany("INSERT INTO foo VALUES (?,?,?,?,?,?,?,?,?)", batch) cursor.execute("COMMIT")
Batching reduces transaction commits from 20 million to 200, minimizing WAL flushes.
Adjust WAL Checkpoint Threshold: Use
PRAGMA wal_autocheckpoint
to control WAL file size. Settingwal_autocheckpoint = 1000
limits WAL to 1000 pages (~1 MB) before automatic checkpointing.
3. Database Configuration and Hardware Tuning
Increase Cache and Page Size:
PRAGMA cache_size = -1000000; -- 1 GB cache PRAGMA page_size = 65536; -- 64 KB pages
Larger pages reduce the number of I/O operations for index scans.
Enable Memory-Mapped I/O:
PRAGMA mmap_size = 30000000000; -- 30 GB
Memory mapping bypasses the filesystem cache, reducing kernel-to-user-space copies.
Separate Databases for Read and Write: Use
ATTACH DATABASE
to import into a temporary database, then merge into the main database:ATTACH DATABASE 'temp.db' AS temp; INSERT INTO main.foo SELECT * FROM temp.foo;
This isolates write-heavy operations from the main database, preventing read contention.
4. Post-Import Maintenance
Analyze Query Plans: Run
ANALYZE
after data changes to update statistics for the query planner:ANALYZE;
This ensures optimal index usage for subsequent read queries.
Monitor WAL File Size: Use
PRAGMA wal_checkpoint(TRUNCATE)
after imports to force WAL truncation.
5. Alternative Approaches
Use SQLite’s
.mode insert
for Bulk Transfers: Convert the CSV into a series ofINSERT
statements and load them in a single transaction:sqlite3 database.db ".mode insert" ".import sample.csv foo"
This approach consolidates inserts but may require splitting large CSVs to avoid memory exhaustion.
Leverage External Tools: Use
csvsql
from thecsvkit
suite to generate optimized SQL:csvsql --db sqlite:///database.db --insert sample.csv
By systematically addressing index management, transaction granularity, and database configuration, bulk import times can be stabilized at near-initial speeds while maintaining read availability.