Recovering Corrupted SQLite Databases with Encoding Mismatch and Page Corruption
Database Corruption Symptoms: Incomplete Dumps and Truncated Text Fields in Recovered Data
Issue Overview
A corrupted SQLite database often manifests through specific error patterns during recovery attempts. In this scenario, the primary symptoms include:
-
Partial Data Recovery via
.dumpCommand: The SQLite.dumputility generates an incomplete SQL script containing only data prior to a specific date (e.g., March), omitting newer records. This occurs because the.dumpcommand relies on the database’s logical structure being intact. If corruption affects critical system tables (e.g.,sqlite_master) or key data pages, the utility cannot traverse the entire database, resulting in truncated output. -
Truncated Text Fields in
lost_and_foundTable: Therecovercommand generates alost_and_foundtable containing raw data salvaged from corrupted pages. However, text fields in this table appear abbreviated (e.g., only the first character of a string is preserved). Numeric fields, such as integers or dates, remain intact. This discrepancy arises from a mismatch between the database’s stored text encoding (e.g., UTF-16) and the encoding assumed by the recovery tool (e.g., UTF-8). -
File Access and Integrity Failures: Attempts to copy, compress, or back up the corrupted database file fail with CRC (Cyclic Redundancy Check) errors. Running
PRAGMA integrity_checkreturns "page not found" errors, confirming physical or logical corruption in the database file’s page structure.
The core challenge involves reconciling the database’s encoding scheme with the recovery process and addressing page-level corruption to extract complete data.
Root Causes: Encoding Mismatches, Page Corruption, and Recovery Tool Limitations
Possible Causes
-
Text Encoding Mismatch:
SQLite databases store text data using one of four encodings: UTF-8, UTF-16le, UTF-16be, or UTF-16 (with BOM). The encoding is specified in the database header’s file format version number field (offset 56). If therecovertool misinterprets this setting—for example, parsing UTF-16 text as UTF-8—it will read multi-byte characters as single-byte, truncating strings after the first byte. This explains why text fields in thelost_and_foundtable show only the first character. -
Physical Page Corruption:
The database file is divided into fixed-size pages (default: 4096 bytes). Corruption in the page structure—caused by disk errors, interrupted writes, or software bugs—can render entire pages unreadable. The.dumpcommand fails when it encounters corrupted pages linked to critical system tables or indices. Therecovertool bypasses some structural checks but cannot reconstruct pages with irrecoverable errors, leading to incomplete data inlost_and_found. -
Recovery Tool Limitations:
The SQLiterecovertool is designed to extract raw data from corrupted databases but does not fully reconstruct the original schema or handle encoding mismatches automatically. Thelost_and_foundtable contains raw binary data from salvaged pages, requiring manual intervention to parse fields correctly. -
Header Field Corruption:
If the database header’s text encoding field (offset 56) is corrupted, SQLite and recovery tools may default to an incorrect encoding, exacerbating text truncation issues.
Resolving Encoding Mismatches, Salvaging Corrupted Pages, and Validating Recovered Data
Troubleshooting Steps, Solutions & Fixes
Step 1: Confirm and Correct Text Encoding
Objective: Ensure the recovery tool interprets text fields using the database’s original encoding.
-
Extract the Encoding Flag:
Use a hex editor (e.g., HxD,hexdump) to inspect the database header:hexdump -n 100 -C corrupted.db | head -n 5Examine offset 56 (0x38 in hex). The value indicates the encoding:
1: UTF-82: UTF-16le3: UTF-16be
If the value is corrupted or inconsistent with the database’s intended encoding, manually correct it using the hex editor.
-
Re-Run Recovery with Correct Encoding:
After correcting the header, use therecovertool to regenerate thelost_and_foundtable:sqlite3 recovered.db ".recover" > recovered.sqlText fields should now display correctly if the encoding mismatch was the sole issue.
Step 2: Salvage Data from Corrupted Pages
Objective: Extract maximum data from corrupted pages using a combination of tools.
-
Use
sqlite3with.mode insert:
For databases with partial corruption, attempt to query tables directly:sqlite3 corrupted.db .mode insert .output partial_dump.sql .dump .quitThis may recover data from intact pages not referenced by the corrupted schema.
-
Leverage Third-Party Tools:
Tools like SQLite Database Recovery (Commercial) or Stellar Repair for SQLite can bypass SQLite’s built-in recovery limitations by directly parsing database pages. -
Manual Extraction from
lost_and_found:
If automatic tools fail, manually parse thelost_and_foundtable:- Identify columns by cross-referencing with schema backups.
- Use
CASTor string functions to concatenate truncated text fields (if encoding is correct).
Step 3: Validate and Rebuild the Database
Objective: Ensure recovered data integrity and reconstruct a functional database.
-
Import Recovered Data:
Create a new database and import the recovered SQL script:sqlite3 new.db < recovered.sql -
Run Integrity Checks:
Verify the new database:PRAGMA integrity_check; PRAGMA foreign_key_check; -
Recreate Indices and Triggers:
If the original schema was partially recovered, manually recreate missing indices, triggers, or views using schema backups.
Step 4: Prevent Future Corruption
-
Enable Write-Ahead Logging (WAL):
Reduce corruption risk during crashes:PRAGMA journal_mode = WAL; -
Implement Regular Backups:
UseVACUUM INTOorsqlite3_backupAPI for online backups. -
Monitor Disk Health:
Use tools likesmartctlto detect failing storage hardware.
By systematically addressing encoding mismatches, leveraging specialized recovery tools, and validating the rebuilt database, users can maximize data recovery from corrupted SQLite files.