Optimizing SQLite Bulk CSV Imports with Concurrent Read Access

Issue Overview: Slow Repeated Bulk CSV Imports in WAL Mode
The core challenge involves efficiently importing a 10 GB CSV file (containing 20 million rows) into an SQLite database while ensuring read operations can proceed without blocking. The initial import completes in 4 minutes, but subsequent imports degrade to 30 minutes despite using Write-Ahead Logging (WAL) mode. The schema includes a table foo with nine columns and an index on the ip column. The process involves deleting existing data, recreating the table structure, importing CSV data, and rebuilding the index. Key symptoms include performance degradation on repeated imports, suboptimal transaction management, and contention between write-heavy operations and concurrent reads.

WAL mode reduces write-read contention by allowing readers to access the database while writers append changes to a separate log. However, large bulk imports introduce complexities: index maintenance, autovacuum inefficiencies, and WAL file growth. The DELETE FROM foo command marks rows as free space without reclaiming storage, leading to database bloat. Subsequent imports must navigate fragmented pages, increasing I/O overhead. The CREATE INDEX command after import forces a full table scan and index build, which becomes slower as the database grows. The .import command in the SQLite shell uses individual INSERT statements under the hood, generating excessive transaction overhead.

Possible Causes: Indexing, Fragmentation, and Transaction Overhead

Index Maintenance During Bulk Operations: Creating or dropping indexes during repeated imports introduces non-linear scalability. The CREATE INDEX command after .import rebuilds the index from scratch, which requires sorting all 20 million rows. This operation scales with O(n log n) complexity, becoming disproportionately slow as data volume increases.
Database Fragmentation and Free Page Management: The DELETE FROM foo command does not reduce the database size; it marks pages as free for reuse. Over time, free space becomes fragmented, forcing the database to seek non-contiguous pages during bulk inserts. This increases I/O latency, especially on rotational disks or cloud storage with higher seek times.
Implicit Transaction Boundaries in .import: The SQLite shell’s .import command splits CSV rows into individual INSERT statements. Without explicit batching, each INSERT operates in an autocommit transaction, triggering frequent WAL flushes and checkpointing. This generates excessive write amplification, particularly when combined with WAL mode’s append-only design.
Insufficient Cache Configuration: The default cache_size (2000 pages, ~2 MB) is inadequate for 20 million rows. Small caches force SQLite to evict pages prematurely, increasing disk I/O during index builds and data imports.
WAL File Growth and Checkpoint Stalls: Large imports append data to the WAL file, which must be checkpointed into the main database. If readers hold open transactions, checkpointing stalls, causing the WAL file to grow unbounded. This increases commit latency and read times due to longer WAL replay.
Schema Definition and Type Affinity: The CREATE TABLE statement uses TEXT for all fields except num1 and num2. Implicit type conversions during CSV parsing can slow down inserts, especially if numeric fields are stored as text.

Troubleshooting Steps, Solutions & Fixes
1. Pre-Import Optimization: Schema and Index Management

Defer Index Creation: Build indexes after data import to avoid incremental index updates. Replace CREATE INDEX idx ON foo(ip) with a two-phase process:

-- Phase 1: Import data without indexes
PRAGMA journal_mode=WAL;
PRAGMA cache_size = -1000000;  -- 1 GB cache
BEGIN;
DROP INDEX IF EXISTS idx;
DELETE FROM foo;
COMMIT;
VACUUM;  -- Reclaim free space
.mode csv
.import sample.csv foo
-- Phase 2: Build indexes after data is loaded
BEGIN;
CREATE INDEX idx ON foo(ip);
COMMIT;

The VACUUM command rebuilds the database file, eliminating fragmentation.

Use WITHOUT ROWID for Index-Intensive Tables: If queries frequently access columns via the ip index, consider a WITHOUT ROWID table to cluster data by the index:
```
CREATE TABLE IF NOT EXISTS foo(
  ip TEXT PRIMARY KEY,
  field1 TEXT,
  ...,
  field6 TEXT
) WITHOUT ROWID;
```
This eliminates the separate rowid storage and clusters rows by ip, reducing index lookup overhead.

2. Transaction and I/O Optimization

Batch Inserts with Explicit Transactions: Replace .import with a custom script that batches inserts into multi-row transactions. For example, using Python:

import sqlite3, csv
conn = sqlite3.connect('database.db')
cursor = conn.cursor()
cursor.execute("PRAGMA cache_size = -1000000")
cursor.execute("BEGIN")
with open('sample.csv', 'r') as f:
    reader = csv.reader(f)
    batch = []
    for row in reader:
        batch.append(row)
        if len(batch) == 100000:  # Commit every 100k rows
            cursor.executemany("INSERT INTO foo VALUES (?,?,?,?,?,?,?,?,?)", batch)
            cursor.execute("COMMIT; BEGIN")
            batch = []
    if batch:
        cursor.executemany("INSERT INTO foo VALUES (?,?,?,?,?,?,?,?,?)", batch)
cursor.execute("COMMIT")

Batching reduces transaction commits from 20 million to 200, minimizing WAL flushes.

Adjust WAL Checkpoint Threshold: Use PRAGMA wal_autocheckpoint to control WAL file size. Setting wal_autocheckpoint = 1000 limits WAL to 1000 pages (~1 MB) before automatic checkpointing.

3. Database Configuration and Hardware Tuning

Increase Cache and Page Size:
```
PRAGMA cache_size = -1000000;  -- 1 GB cache
PRAGMA page_size = 65536;      -- 64 KB pages
```
Larger pages reduce the number of I/O operations for index scans.
Enable Memory-Mapped I/O:
```
PRAGMA mmap_size = 30000000000;  -- 30 GB
```
Memory mapping bypasses the filesystem cache, reducing kernel-to-user-space copies.
Separate Databases for Read and Write: Use ATTACH DATABASE to import into a temporary database, then merge into the main database:
```
ATTACH DATABASE 'temp.db' AS temp;
INSERT INTO main.foo SELECT * FROM temp.foo;
```
This isolates write-heavy operations from the main database, preventing read contention.

4. Post-Import Maintenance

Analyze Query Plans: Run ANALYZE after data changes to update statistics for the query planner:
```
ANALYZE;
```
This ensures optimal index usage for subsequent read queries.
Monitor WAL File Size: Use PRAGMA wal_checkpoint(TRUNCATE) after imports to force WAL truncation.

5. Alternative Approaches

Use SQLite’s .mode insert for Bulk Transfers: Convert the CSV into a series of INSERT statements and load them in a single transaction:
```
sqlite3 database.db ".mode insert" ".import sample.csv foo"
```
This approach consolidates inserts but may require splitting large CSVs to avoid memory exhaustion.
Leverage External Tools: Use csvsql from the csvkit suite to generate optimized SQL:
```
csvsql --db sqlite:///database.db --insert sample.csv
```

By systematically addressing index management, transaction granularity, and database configuration, bulk import times can be stabilized at near-initial speeds while maintaining read availability.