Incomplete SQLite3 Disk Format Visualization in Kaitai Struct Specifications


Structural Gaps in SQLite3 Kaitai Specification Implementation

Issue Overview: Limitations in SQLite3 Binary Format Representation
The core challenge revolves around accurately modeling SQLite3’s binary disk format using Kaitai Struct, a declarative language for describing binary data structures. The initial implementation of the SQLite3 Kaitai specification lacked critical components required to fully parse and visualize database files, including:

  1. Freelist Trunk/Leaf Page Handling: SQLite3 uses a freelist to track unused pages. The original specification did not account for trunk pages (which store pointers to freelist leaf pages) or leaf pages (which contain actual unused page numbers). This omission prevented accurate reconstruction of database fragmentation patterns.

  2. B-Tree Page Type Discrimination: SQLite3 stores both table and index data as B-tree structures. The initial implementation failed to distinguish between table interior, table leaf, index interior, and index leaf page types, leading to incorrect interpretation of cell payload formats.

  3. Header Field Omissions: Key header fields like version_valid_for, sqlite_version_number, and reserved bytes were either partially implemented or missing entirely. These fields are critical for validating database integrity during version upgrades or recovery scenarios.

  4. Variable-Length Integer (Varint) Parsing: SQLite3 employs compact 1-9 byte varints for storing row IDs and payload sizes. The original specification used fixed-size integers, causing misalignment in cell content parsing.

  5. Cell Content Hierarchy: The relationship between cell pointers, payload headers, and serial types was not fully modeled. This made it impossible to recursively decode column values from payloads, especially when overflow pages were involved.

The consequences of these gaps manifest as incomplete visualizations in Kaitai’s CLI tools – for example, inability to traverse free pages, truncated payload displays, and failure to resolve nested B-tree structures. These limitations reduce the utility of Kaitai Struct for forensic analysis or educational deep dives into SQLite3 internals.


Root Causes of Specification Incompleteness

1. Evolving Format Complexity
SQLite3’s file format has accumulated backward-compatible changes over 20+ years. Features like auto-vacuum (post-3.3.14), write-ahead logging (WAL, 3.7.0), or pointer map pages (3.6.18) introduced new header flags and page types. Kaitai specifications targeting older SQLite versions naturally omit newer fields unless explicitly version-gated.

2. Documentation Ambiguity
While SQLite’s file format documentation is thorough, certain edge cases require reading the actual C source code. For example:

  • The reserved header bytes (offset 72-91) are reserved for future expansion but must be zeroed in current versions.
  • schema_format values ≥4 enable features like descending indexes and true boolean storage.
  • Free list trunk pages use a 4-byte big-endian integer at offset 0 for the next trunk pointer, conflicting with the little-endian default in many specs.

3. Variable-Length Payloads
Cell payloads in leaf pages use a serialized header that encodes column types and lengths as a varint stream. Parsing this requires:

  • Iterating through varints to determine column count
  • Differentiating between serial types (e.g., 0 = NULL, 1 = 1-byte int, 12 = BLOB with 12-12 bytes)
  • Handling overflow pages when payload exceeds max_embedded_payload (calculated from page_size and leaf_payload_frac)

4. Page Type-Specific Logic
Each B-tree page type has unique cell formats:

  • Table Interior: 4-byte child page pointer + varint row ID
  • Index Leaf: Key payload without row ID
  • Table Leaf: Payload with row ID + optional overflow

Failing to branch on page_type (byte 0 offset in page) leads to invalid pointer dereferencing.


Comprehensive Specification Remediation Strategy

Step 1: Header Field Augmentation
Update the Kaitai Struct definition (sqlite3.ksy) to include all 100 bytes of the database header:

seq:
  - id: magic
    contents: "SQLite format 3\000"
  - id: len_page_mod
    type: u2
    doc: Page size (1 << (len_page_mod - 1))  
  - id: write_version
    type: u1
    enum: versions
  # ... existing fields ...
  - id: version_valid_for
    type: u4
    doc: Header version when sqlite_version_number was set
  - id: sqlite_version_number
    type: u4
    doc: SQLITE_VERSION_NUMBER from header writer

Step 2: Page Type Discrimination
Implement a page_type enum and conditional parsing:

enums:
  page_type:
    0x02: table_interior
    0x05: table_leaf
    0x0a: index_interior
    0x0d: index_leaf

types:
  page:
    seq:
      - id: page_type
        type: u1
        enum: page_type
      - id: first_freeblock
        type: u2
        if: page_type in ['table_leaf', 'index_leaf']
      - id: num_cells
        type: u2
      # ... cell pointer array ...
      - id: cells
        type: cell
        repeat: expr
        repeat-expr: num_cells

Step 3: Freelist Chain Resolution
Model trunk pages with next pointers and leaf page arrays:

types:
  freelist_trunk:
    seq:
      - id: next_trunk_page
        type: u4be  # Big-endian per SQLite docs
      - id: num_leaf_ptrs
        type: u4be
      - id: leaf_page_numbers
        type: u4be
        repeat: expr
        repeat-expr: num_leaf_ptrs

Step 4: Varint Handling
Define a custom varint type with Kaitai’s process attribute:

types:
  varint:
    seq:
      - id: bytes
        type: u1
        repeat: until
        repeat-until: _io.pos >= _io.size || (byte & 0x80) == 0
    instances:
      value:
        value: >
          var val = 0;
          for (var i = 0; i < bytes.length; i++) {
            val = (val << 7) | (bytes[i] & 0x7f);
          }
          val

Step 5: Cell Payload Recursion
Leverage Kaitai’s parameterized types to pass page_type and schema_format into cells:

types:
  cell:
    params:
      - id: page_type
        type: u1
      - id: schema_format
        type: u4
    seq:
      - id: row_id
        type: varint
        if: page_type == table_leaf
      - id: payload_header
        type: payload_header(schema_format)
    types:
      payload_header:
        seq:
          - id: len_header
            type: varint
          - id: column_serials
            type: varint
            repeat: expr
            repeat-expr: _parent._root.header.schema_format >= 4 ? (len_header / 2) : (len_header / 1)

Validation Protocol

  1. Hex Dump Alignment: Compare Kaitai’s visualization against hexdump -C output to verify byte offsets.
  2. Reference SQLite CLI: Use .dbinfo and .header on in sqlite3 shell to cross-check num_freelist_pages or schema_cookie.
  3. Fuzzing: Generate test databases with varying page sizes, auto-vacuum modes, and row formats.
  4. Version Testing: Validate against SQLite 3.0 (legacy), 3.30 (WAL), and latest 3.45.

Example Workflow

# Generate test DB
sqlite3 test.db "CREATE TABLE t1(a INTEGER PRIMARY KEY, b TEXT);"
sqlite3 test.db "VACUUM;"  # Force freelist population

# Visualize with Kaitai
ksv -f sqlite3.ksy test.db

Anticipated Visualization Output

[-] pages (4 entries)
  [-] 0 (table_leaf)
    [.] num_cells = 1
    [-] cells
      [.] row_id = 1
      [-] payload
        [.] column_serials = [1, 13]
        [-] column_contents
          [0] 1 (integer)
          [1] "sample text"
  [-] 1 (freelist_trunk)
    [.] next_trunk_page = 0
    [.] leaf_page_numbers = [2,3]

By methodically addressing each gap through type refinements, conditional parsing, and validation against ground truth data, the enhanced Kaitai specification becomes a robust tool for SQLite3 forensics and education.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *