Duplicated and Missing Rows Due to Corrupted Indexes When Using ORDER BY
Issue Overview: Corrupted Indexes Causing Inconsistent Query Results with ORDER BY
When executing a SELECT
query with an ORDER BY
clause in SQLite, the presence of corrupted indexes can lead to unexpected behavior such as duplicated rows, missing rows, or inconsistencies between sorted and unsorted query results. This issue arises when the database engine relies on a corrupted index to fulfill the sorting requirement of the query. The corruption may manifest as missing entries in the index, incorrect linkages between index entries and table rows, or mismatches between the index and the underlying table data.
The problem becomes apparent when comparing the output of a simple SELECT * FROM table
query (which may bypass the corrupted index) with a SELECT * FROM table ORDER BY column
query (which may utilize the corrupted index). The former might return the correct number of rows with unique values, while the latter could produce duplicates or omit rows entirely. This discrepancy occurs because SQLite’s query planner chooses different access paths depending on whether sorting is required. When an index is corrupted, the engine may erroneously traverse invalid index entries, skip valid ones, or misalign index keys with table rows.
A critical aspect of this issue is the role of UNIQUE
constraints. Columns with such constraints should theoretically prevent duplicate values, but corruption in the associated indexes (including implicit indexes created for UNIQUE
constraints) can bypass these safeguards. For example, a corrupted index might fail to detect duplicate entries during insertion, allow duplicate entries to persist, or misreport the existence of entries during queries. This undermines the integrity guarantees provided by UNIQUE
constraints and primary keys.
The symptoms described—such as CSV delimiter changes during export—are secondary effects. The delimiter change is likely due to SQLite’s .mode
command in the command-line shell not being explicitly set to csv
for both queries, causing it to default to another mode (e.g., list
, which uses pipes). The core issue remains the database corruption affecting index integrity.
Possible Causes: Index Corruption and Query Planner Behavior
1. Index Corruption Due to Improper Shutdown or Hardware Failure
SQLite databases are generally resilient to corruption, but abrupt interruptions during write operations (e.g., power failures, application crashes) can leave indexes in an inconsistent state. When an index is being updated, a partial write or incomplete transaction commit may result in missing or orphaned index entries. For example, if a row is inserted into a table but the corresponding index entry is not fully written, subsequent queries using that index may skip the row or misalign its position. This explains why a SELECT
without ORDER BY
(which scans the table directly) returns correct results, while a sorted query (which uses the corrupted index) produces discrepancies.
2. Implicit Index Corruption for UNIQUE Constraints and Primary Keys
Columns with UNIQUE
constraints or primary keys automatically create internal indexes (e.g., sqlite_autoindex_*
). Corruption in these indexes can lead to violations of uniqueness that the database engine fails to detect. For instance, if the index entry for a unique path
value is missing, the engine might allow duplicate path
values to be inserted, despite the constraint. Similarly, queries relying on these indexes for sorting or filtering may return incorrect results. The PRAGMA integrity_check;
output in the discussion explicitly identifies missing index entries for path_index
and sqlite_autoindex_files_2
, confirming this cause.
3. Query Planner’s Index Selection Heuristics
SQLite’s query planner decides whether to use an index based on factors like the presence of ORDER BY
, WHERE
clauses, and statistical data. When an ORDER BY
clause matches the sort order of an index, the planner prioritizes that index to avoid a costly SORT
operation. If the chosen index is corrupted, the query returns data based on the index’s flawed structure. This explains why the unsorted query (table scan) works correctly, while the sorted query (index scan) fails. The planner’s reliance on corrupted metadata (e.g., incorrect row counts in sqlite_stat1
) can exacerbate the issue by favoring corrupted indexes unnecessarily.
Troubleshooting Steps, Solutions & Fixes: Diagnosing and Resolving Index Corruption
Step 1: Confirm Corruption with PRAGMA integrity_check
Execute PRAGMA integrity_check;
to identify inconsistencies between tables and indexes. This command performs a thorough verification of the entire database, checking for:
- Missing or extra index entries
- Incorrect rowid mappings
- Invalid data types or formats
- Orphaned rows in tables without corresponding index entries
Example output indicating index corruption:
row 10828 missing from index path_index
row 10828 missing from index sqlite_autoindex_files_2
wrong # of entries in index path_index
If errors are found, proceed to recovery steps.
Step 2: Recover Data Using the .recover Command
The SQLite shell’s .recover
command reconstructs the database by extracting data from all accessible pages, bypassing corrupted structures. This is safer than manual repairs but may not preserve all constraints or triggers.
Procedure:
- Backup the original database:
cp original.db original.backup.db
- Recover data into a new database:
sqlite3 original.db ".recover" | sqlite3 recovered.db
- Verify the new database:
sqlite3 recovered.db "PRAGMA integrity_check;"
Caveats:
- The
.recover
command may fail to reconstruct AUTOINCREMENT sequences or virtual tables. - Encoding issues (e.g., broken special characters) can occur if the original database used non-UTF8 encodings. Use the
-ascii
,-utf16
, or-utf8
flags with.recover
to specify encoding.
Step 3: Rebuild Indexes with REINDEX
After recovery, rebuild all indexes to ensure consistency:
REINDEX;
This command drops and recreates every index in the database, including autoindexes for UNIQUE
constraints. For large databases, this may take considerable time.
Step 4: Prevent Future Corruption
Enable Write-Ahead Logging (WAL):
PRAGMA journal_mode=WAL;
WAL reduces the risk of corruption by separating writes into a log file, minimizing direct modification of the main database.
Avoid Hard Shutdowns During Writes: Implement graceful shutdown handlers in applications to ensure transactions complete before exit.
Regularly Check Integrity: Schedule periodic
PRAGMA integrity_check;
runs, especially after unexpected shutdowns or errors.Use Application-Level Safeguards: Validate data uniqueness in application code before insertion, even when
UNIQUE
constraints exist.
Step 5: Handle Encoding Issues Post-Recovery
If .recover
alters text encoding, use the sqlite3
shell’s .dump
command to export SQL statements, then re-import with the correct encoding:
sqlite3 corrupted.db ".dump" | iconv -f ISO-8859-1 -t UTF-8 | sqlite3 fixed.db
Replace ISO-8859-1
and UTF-8
with the appropriate source and target encodings.
Step 6: Verify Query Planner Behavior
After recovery, use EXPLAIN QUERY PLAN
to analyze how the engine executes sorted queries:
EXPLAIN QUERY PLAN SELECT * FROM files ORDER BY path ASC;
Ensure the output references the correct index (e.g., USING INDEX path_index
). If the planner avoids a corrupted index, the repair was successful.
By systematically addressing index corruption and understanding SQLite’s query planning mechanics, users can resolve inconsistencies arising from sorted queries and restore database integrity.