Resolving Intermittent SQLite Disk I/O Errors in High-Read Environments
Understanding Intermittent SQLite Disk I/O Errors in WAL Mode with Ubuntu and PocketBase
Issue Overview
The core issue involves SQLite databases configured in Write-Ahead Logging (WAL) mode experiencing sporadic Disk I/O errors under high-read workloads, particularly in environments using PocketBase as a REST API wrapper. These errors resolve temporarily after restarting the PocketBase service or the underlying Ubuntu 20.04 VPS but reoccur after days of operation. Key observations include:
- The database passes
PRAGMA integrity_check(no structural corruption). - The system uses standard ext4/XFS filesystems (not NFS).
- File descriptor limits (
ulimit -n) were increased to 4096, reducing error frequency but not eliminating it. - No direct evidence of hardware failures or storage quota exhaustion.
- PocketBase does not implement custom connection pooling or transaction management.
The transient nature of the error and its resolution via process termination suggest a resource contention or stateful I/O handling issue tied to SQLite’s interaction with the operating system, filesystem, or PocketBase’s connection lifecycle. The problem is exacerbated in WAL mode due to its reliance on shared memory (-shm) and write-ahead log (-wal) files, which require stable file handles and consistent synchronization.
Root Causes: File Descriptor Leaks, WAL File Management, and Kernel-Level Contention
-
File Descriptor Exhaustion in High-Concurrency Scenarios
Whileulimit -n 4096increases the per-process limit, system-wide limits (e.g.,/proc/sys/fs/file-max) or cgroup restrictions in virtualized environments (DigitalOcean droplets) may throttle open files. SQLite WAL mode maintains persistent file handles for -shm and -wal, which can accumulate if connections are not closed properly. PocketBase’s REST API may inadvertently leave connections open during high traffic, leading to gradual leakage until the OS denies further I/O operations. -
WAL File Synchronization Failures
WAL mode relies onfsync()andflock()to ensure atomic commits. On Linux, these operations are sensitive to:- Filesystem mount options:
noatime,nodiratime, orsync/asyncmismatches. - Kernel bugs or filesystem driver issues: Ext4 journaling conflicts with WAL’s write ordering.
- Virtualized storage latency: DigitalOcean’s block storage may introduce micro-delays in I/O, causing SQLite’s internal timeout mechanisms to fail.
- Filesystem mount options:
-
Inode or Page Cache Saturation
Heavy read workloads can pressure the Linux kernel’s page cache, especially ifvm.dirty_ratioorvm.dirty_background_ratioare misconfigured. SQLite’s reliance on memory-mapped I/O in WAL mode may exacerbate this, leading to stalled writes and misinterpreted I/O errors. -
SELinux/AppArmor Restrictions
Mandatory access control frameworks might intermittently block SQLite’s attempts to resize -shm/-wal files, particularly after prolonged uptime where file attributes (e.g., SELinux contexts) drift. -
PocketBase Connection Lifecycle Bugs
While PocketBase claims to close connections on shutdown, uncommitted transactions or orphaned locks during abrupt HTTP request cancellations might leave the database in a state requiring manual intervention.
Mitigation Strategies: Debugging, Configuration Tuning, and Alternative Architectures
Step 1: Diagnose File Descriptor and Process Limits
- Monitor open files in real-time:
watch -n 1 "ls /proc/$(pgrep pocketbase)/fd | wc -l"If the count approaches 4096, implement connection pooling or reduce
max_open_connsin PocketBase. - Inspect system-wide file limits:
cat /proc/sys/fs/file-max sysctl fs.file-nrIf
fs.file-maxis low (e.g., 65535), increase it via/etc/sysctl.conf:fs.file-max = 2097152
Step 2: Audit WAL File Handling and Synchronization
- Disable WAL mode temporarily:
PRAGMA journal_mode = DELETE;If errors cease, investigate WAL-specific interactions.
- Force synchronous writes:
PRAGMA synchronous = FULL;This reduces performance but ensures writes complete before returning.
- Verify filesystem mount options:
mount | grep /path/to/databaseEnsure
noatimeandnodiratimeare absent, as they can interfere with SQLite’s timestamp-based locking.
Step 3: Kernel and Virtualization Layer Checks
- Review kernel logs for I/O errors:
dmesg | grep -i 'error' journalctl -k --since "1 hour ago" | grep -i 'I/O'Look for SCSI/ATA errors indicating underlying storage issues.
- Test storage latency:
fio --name=latency-test --ioengine=sync --rw=randwrite --bs=4k --numjobs=1 --size=1G --runtime=60 --time_basedLatency spikes >100ms suggest DigitalOcean storage bottlenecks.
Step 4: SELinux/AppArmor and Security Policy Audits
- Check for SELinux denials:
ausearch -m avc -ts recentTemporary disable SELinux to test:
setenforce 0If errors disappear, create a policy allowing SQLite’s file operations.
Step 5: Implement Connection Pooling and Timeouts
Modify PocketBase’s database initialization to enforce connection limits and idle timeouts (example Go code):
import (
"database/sql"
_ "github.com/mattn/go-sqlite3"
)
func NewDB() (*sql.DB, error) {
db, err := sql.Open("sqlite3", "./data.db?_journal_mode=WAL&_timeout=5000")
if err != nil {
return nil, err
}
db.SetMaxOpenConns(20)
db.SetMaxIdleConns(5)
db.SetConnMaxLifetime(30 * time.Minute)
return db, nil
}
Step 6: Alternative Storage Engines and Failover
If errors persist, consider:
- Migrating to PostgreSQL: PocketBase supports PostgreSQL, which handles high concurrency more robustly.
- Using a networked filesystem with strict consistency: Replace ext4 with ZFS (
zfs set sync=always) to enforce write ordering.
Long-Term Monitoring and Prevention
- Deploy Prometheus/Grafana dashboards tracking:
node_filefd_allocated(open file descriptors).node_disk_io_time_seconds(storage latency).
- Schedule periodic PocketBase restarts via systemd timers to reset connection states preemptively.
- Patch SQLite to 3.45+: Recent versions include fixes for WAL file recovery after crashes.
By systematically addressing file descriptor leaks, WAL synchronization nuances, and virtualization-layer quirks, recurring Disk I/O errors can be eliminated without sacrificing SQLite’s lightweight advantages.