Recovering Corrupted SQLite Databases with Encoding Mismatch and Page Corruption

Database Corruption Symptoms: Incomplete Dumps and Truncated Text Fields in Recovered Data

Issue Overview

A corrupted SQLite database often manifests through specific error patterns during recovery attempts. In this scenario, the primary symptoms include:

  1. Partial Data Recovery via .dump Command: The SQLite .dump utility generates an incomplete SQL script containing only data prior to a specific date (e.g., March), omitting newer records. This occurs because the .dump command relies on the database’s logical structure being intact. If corruption affects critical system tables (e.g., sqlite_master) or key data pages, the utility cannot traverse the entire database, resulting in truncated output.

  2. Truncated Text Fields in lost_and_found Table: The recover command generates a lost_and_found table containing raw data salvaged from corrupted pages. However, text fields in this table appear abbreviated (e.g., only the first character of a string is preserved). Numeric fields, such as integers or dates, remain intact. This discrepancy arises from a mismatch between the database’s stored text encoding (e.g., UTF-16) and the encoding assumed by the recovery tool (e.g., UTF-8).

  3. File Access and Integrity Failures: Attempts to copy, compress, or back up the corrupted database file fail with CRC (Cyclic Redundancy Check) errors. Running PRAGMA integrity_check returns "page not found" errors, confirming physical or logical corruption in the database file’s page structure.

The core challenge involves reconciling the database’s encoding scheme with the recovery process and addressing page-level corruption to extract complete data.


Root Causes: Encoding Mismatches, Page Corruption, and Recovery Tool Limitations

Possible Causes

  1. Text Encoding Mismatch:
    SQLite databases store text data using one of four encodings: UTF-8, UTF-16le, UTF-16be, or UTF-16 (with BOM). The encoding is specified in the database header’s file format version number field (offset 56). If the recover tool misinterprets this setting—for example, parsing UTF-16 text as UTF-8—it will read multi-byte characters as single-byte, truncating strings after the first byte. This explains why text fields in the lost_and_found table show only the first character.

  2. Physical Page Corruption:
    The database file is divided into fixed-size pages (default: 4096 bytes). Corruption in the page structure—caused by disk errors, interrupted writes, or software bugs—can render entire pages unreadable. The .dump command fails when it encounters corrupted pages linked to critical system tables or indices. The recover tool bypasses some structural checks but cannot reconstruct pages with irrecoverable errors, leading to incomplete data in lost_and_found.

  3. Recovery Tool Limitations:
    The SQLite recover tool is designed to extract raw data from corrupted databases but does not fully reconstruct the original schema or handle encoding mismatches automatically. The lost_and_found table contains raw binary data from salvaged pages, requiring manual intervention to parse fields correctly.

  4. Header Field Corruption:
    If the database header’s text encoding field (offset 56) is corrupted, SQLite and recovery tools may default to an incorrect encoding, exacerbating text truncation issues.


Resolving Encoding Mismatches, Salvaging Corrupted Pages, and Validating Recovered Data

Troubleshooting Steps, Solutions & Fixes

Step 1: Confirm and Correct Text Encoding

Objective: Ensure the recovery tool interprets text fields using the database’s original encoding.

  1. Extract the Encoding Flag:
    Use a hex editor (e.g., HxD, hexdump) to inspect the database header:

    hexdump -n 100 -C corrupted.db | head -n 5  
    

    Examine offset 56 (0x38 in hex). The value indicates the encoding:

    • 1: UTF-8
    • 2: UTF-16le
    • 3: UTF-16be

    If the value is corrupted or inconsistent with the database’s intended encoding, manually correct it using the hex editor.

  2. Re-Run Recovery with Correct Encoding:
    After correcting the header, use the recover tool to regenerate the lost_and_found table:

    sqlite3 recovered.db ".recover" > recovered.sql  
    

    Text fields should now display correctly if the encoding mismatch was the sole issue.

Step 2: Salvage Data from Corrupted Pages

Objective: Extract maximum data from corrupted pages using a combination of tools.

  1. Use sqlite3 with .mode insert:
    For databases with partial corruption, attempt to query tables directly:

    sqlite3 corrupted.db  
    .mode insert  
    .output partial_dump.sql  
    .dump  
    .quit  
    

    This may recover data from intact pages not referenced by the corrupted schema.

  2. Leverage Third-Party Tools:
    Tools like SQLite Database Recovery (Commercial) or Stellar Repair for SQLite can bypass SQLite’s built-in recovery limitations by directly parsing database pages.

  3. Manual Extraction from lost_and_found:
    If automatic tools fail, manually parse the lost_and_found table:

    • Identify columns by cross-referencing with schema backups.
    • Use CAST or string functions to concatenate truncated text fields (if encoding is correct).

Step 3: Validate and Rebuild the Database

Objective: Ensure recovered data integrity and reconstruct a functional database.

  1. Import Recovered Data:
    Create a new database and import the recovered SQL script:

    sqlite3 new.db < recovered.sql  
    
  2. Run Integrity Checks:
    Verify the new database:

    PRAGMA integrity_check;  
    PRAGMA foreign_key_check;  
    
  3. Recreate Indices and Triggers:
    If the original schema was partially recovered, manually recreate missing indices, triggers, or views using schema backups.

Step 4: Prevent Future Corruption

  1. Enable Write-Ahead Logging (WAL):
    Reduce corruption risk during crashes:

    PRAGMA journal_mode = WAL;  
    
  2. Implement Regular Backups:
    Use VACUUM INTO or sqlite3_backup API for online backups.

  3. Monitor Disk Health:
    Use tools like smartctl to detect failing storage hardware.


By systematically addressing encoding mismatches, leveraging specialized recovery tools, and validating the rebuilt database, users can maximize data recovery from corrupted SQLite files.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *