Resolving Intermittent SQLite Disk I/O Errors in High-Read Environments


Understanding Intermittent SQLite Disk I/O Errors in WAL Mode with Ubuntu and PocketBase

Issue Overview

The core issue involves SQLite databases configured in Write-Ahead Logging (WAL) mode experiencing sporadic Disk I/O errors under high-read workloads, particularly in environments using PocketBase as a REST API wrapper. These errors resolve temporarily after restarting the PocketBase service or the underlying Ubuntu 20.04 VPS but reoccur after days of operation. Key observations include:

  • The database passes PRAGMA integrity_check (no structural corruption).
  • The system uses standard ext4/XFS filesystems (not NFS).
  • File descriptor limits (ulimit -n) were increased to 4096, reducing error frequency but not eliminating it.
  • No direct evidence of hardware failures or storage quota exhaustion.
  • PocketBase does not implement custom connection pooling or transaction management.

The transient nature of the error and its resolution via process termination suggest a resource contention or stateful I/O handling issue tied to SQLite’s interaction with the operating system, filesystem, or PocketBase’s connection lifecycle. The problem is exacerbated in WAL mode due to its reliance on shared memory (-shm) and write-ahead log (-wal) files, which require stable file handles and consistent synchronization.


Root Causes: File Descriptor Leaks, WAL File Management, and Kernel-Level Contention

  1. File Descriptor Exhaustion in High-Concurrency Scenarios
    While ulimit -n 4096 increases the per-process limit, system-wide limits (e.g., /proc/sys/fs/file-max) or cgroup restrictions in virtualized environments (DigitalOcean droplets) may throttle open files. SQLite WAL mode maintains persistent file handles for -shm and -wal, which can accumulate if connections are not closed properly. PocketBase’s REST API may inadvertently leave connections open during high traffic, leading to gradual leakage until the OS denies further I/O operations.

  2. WAL File Synchronization Failures
    WAL mode relies on fsync() and flock() to ensure atomic commits. On Linux, these operations are sensitive to:

    • Filesystem mount options: noatime, nodiratime, or sync/async mismatches.
    • Kernel bugs or filesystem driver issues: Ext4 journaling conflicts with WAL’s write ordering.
    • Virtualized storage latency: DigitalOcean’s block storage may introduce micro-delays in I/O, causing SQLite’s internal timeout mechanisms to fail.
  3. Inode or Page Cache Saturation
    Heavy read workloads can pressure the Linux kernel’s page cache, especially if vm.dirty_ratio or vm.dirty_background_ratio are misconfigured. SQLite’s reliance on memory-mapped I/O in WAL mode may exacerbate this, leading to stalled writes and misinterpreted I/O errors.

  4. SELinux/AppArmor Restrictions
    Mandatory access control frameworks might intermittently block SQLite’s attempts to resize -shm/-wal files, particularly after prolonged uptime where file attributes (e.g., SELinux contexts) drift.

  5. PocketBase Connection Lifecycle Bugs
    While PocketBase claims to close connections on shutdown, uncommitted transactions or orphaned locks during abrupt HTTP request cancellations might leave the database in a state requiring manual intervention.


Mitigation Strategies: Debugging, Configuration Tuning, and Alternative Architectures

Step 1: Diagnose File Descriptor and Process Limits

  • Monitor open files in real-time:
    watch -n 1 "ls /proc/$(pgrep pocketbase)/fd | wc -l"  
    

    If the count approaches 4096, implement connection pooling or reduce max_open_conns in PocketBase.

  • Inspect system-wide file limits:
    cat /proc/sys/fs/file-max  
    sysctl fs.file-nr  
    

    If fs.file-max is low (e.g., 65535), increase it via /etc/sysctl.conf:

    fs.file-max = 2097152  
    

Step 2: Audit WAL File Handling and Synchronization

  • Disable WAL mode temporarily:
    PRAGMA journal_mode = DELETE;  
    

    If errors cease, investigate WAL-specific interactions.

  • Force synchronous writes:
    PRAGMA synchronous = FULL;  
    

    This reduces performance but ensures writes complete before returning.

  • Verify filesystem mount options:
    mount | grep /path/to/database  
    

    Ensure noatime and nodiratime are absent, as they can interfere with SQLite’s timestamp-based locking.

Step 3: Kernel and Virtualization Layer Checks

  • Review kernel logs for I/O errors:
    dmesg | grep -i 'error'  
    journalctl -k --since "1 hour ago" | grep -i 'I/O'  
    

    Look for SCSI/ATA errors indicating underlying storage issues.

  • Test storage latency:
    fio --name=latency-test --ioengine=sync --rw=randwrite --bs=4k --numjobs=1 --size=1G --runtime=60 --time_based  
    

    Latency spikes >100ms suggest DigitalOcean storage bottlenecks.

Step 4: SELinux/AppArmor and Security Policy Audits

  • Check for SELinux denials:
    ausearch -m avc -ts recent  
    

    Temporary disable SELinux to test:

    setenforce 0  
    

    If errors disappear, create a policy allowing SQLite’s file operations.

Step 5: Implement Connection Pooling and Timeouts

Modify PocketBase’s database initialization to enforce connection limits and idle timeouts (example Go code):

import (  
  "database/sql"  
  _ "github.com/mattn/go-sqlite3"  
)  

func NewDB() (*sql.DB, error) {  
  db, err := sql.Open("sqlite3", "./data.db?_journal_mode=WAL&_timeout=5000")  
  if err != nil {  
    return nil, err  
  }  
  db.SetMaxOpenConns(20)  
  db.SetMaxIdleConns(5)  
  db.SetConnMaxLifetime(30 * time.Minute)  
  return db, nil  
}  

Step 6: Alternative Storage Engines and Failover

If errors persist, consider:

  • Migrating to PostgreSQL: PocketBase supports PostgreSQL, which handles high concurrency more robustly.
  • Using a networked filesystem with strict consistency: Replace ext4 with ZFS (zfs set sync=always) to enforce write ordering.

Long-Term Monitoring and Prevention

  • Deploy Prometheus/Grafana dashboards tracking:
    • node_filefd_allocated (open file descriptors).
    • node_disk_io_time_seconds (storage latency).
  • Schedule periodic PocketBase restarts via systemd timers to reset connection states preemptively.
  • Patch SQLite to 3.45+: Recent versions include fixes for WAL file recovery after crashes.

By systematically addressing file descriptor leaks, WAL synchronization nuances, and virtualization-layer quirks, recurring Disk I/O errors can be eliminated without sacrificing SQLite’s lightweight advantages.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *