FTS5 Averages Record Mismatch: id=1 vs. id=10 Documentation Error

Understanding the FTS5 Averages Record Storage Discrepancy

The Full-Text Search version 5 (FTS5) extension in SQLite provides advanced text search capabilities through virtual tables. A critical component of its architecture is the internal %_data table, which stores metadata and structural information required for efficient query execution. The documentation describes a specific record within this table (id=10) as containing statistical averages and token counts across columns. However, empirical testing reveals this data is stored in the record with id=1, while id=10 contains structural metadata. This discrepancy between documented behavior and actual implementation creates confusion for developers relying on these details for low-level optimizations, debugging, or custom extensions.

Technical Breakdown of FTS5 Storage Architecture

FTS5 organizes data using multiple shadow tables, with the %_data table serving as a central repository for:

  1. Averages and statistics (incorrectly documented at id=10)
  2. Structure configuration (correctly identified at id=10)
  3. Index segments (stored at higher ID values)
  4. Special configuration records

Each record in %_data uses a binary format with packed varints – variable-length integers that optimize storage space. The misinterpretation of which ID stores specific data types directly impacts developers attempting to:

  • Analyze index health
  • Rebuild corrupted FTS5 tables manually
  • Implement custom ranking algorithms
  • Monitor token distribution across columns

Root Causes of the Documentation-Implementation Mismatch

1. Version-Specific Documentation Drift

The FTS5 module has undergone significant revisions since its introduction. While the core functionality remains stable, internal storage details like record ID assignments may change between minor releases without immediate documentation updates. The confusion arises when:

  • Documentation describes legacy behavior from earlier FTS versions
  • Implementation changes are committed without corresponding doc updates
  • Edge-case handling differs between SQLite core and extensions

2. Misinterpretation of Structure vs. Statistics

The %_data table contains multiple special-purpose records:

  • id=1: Actual averages and token counts (documentation error)
  • id=10: Structure version and compatibility flags
  • id=0: Configuration options
  • Higher IDs: B-tree segments for inverted indexes

The original documentation conflated the purpose of id=10 (structure versioning) with id=1 (statistical data), likely due to:

  • Copy-paste errors from FTS4 documentation
  • Mislabeled internal variable names in source code comments
  • Overlapping storage formats between structure and statistics records

3. Varint Encoding Complexity

Both id=1 and id=10 records use packed varint encoding, making visual inspection of hex dumps challenging:

-- Example from original report:
id=1 | hex(block)=000000000101010001010101
id=10| hex(block)=0000000E063068656C6C6F01020204

The similar appearance of these binary blobs (both starting with zero-packed headers) increases the likelihood of misinterpretation without proper decoding tools.

Resolving the FTS5 Storage Misunderstanding

Step 1: Validate FTS5 Table Structure

Create a minimal test case to verify storage behavior:

CREATE VIRTUAL TABLE test_fts USING fts5(content);
INSERT INTO test_fts VALUES ('sample text');
SELECT id, hex(block) FROM test_fts_data;

Expected output shows:

  • id=1 with 3 varints (row count + per-column token counts)
  • id=10 with structure version metadata
  • Higher IDs containing index segments

Step 2: Decode the id=1 Averages Record

Using SQLite’s fts5_decode() function (requires build with -DSQLITE_DEBUG):

SELECT fts5_decode(block) FROM test_fts_data WHERE id=1;

Output reveals the packed varints structure:

0x00 0x00 0x00 0x01 0x01 0x01 → {nRow=1, tokens=[1]}

Breakdown:

  • First 4 bytes: Header (0x00000001)
  • Next varint: Total rows (1)
  • Subsequent varints: Token counts per column (1 token in first/only column)

Step 3: Analyze id=10 Structure Record

Decode the structure record:

SELECT fts5_decode(block) FROM test_fts_data WHERE id=10;

Output contains:

  • Format version number
  • Compatibility flags
  • Schema checksums
  • Index configuration parameters

Step 4: Update Development Practices

  1. Query id=1 for token statistics instead of id=10
  2. Avoid hardcoding IDs – Use fts5_get() helper function where possible:
    SELECT fts5_get(tbl, 'stat') FROM tbl WHERE tbl='test_fts';
    
  3. Monitor documentation updates – Subscribe to SQLite changelogs
  4. Implement version checks in critical code:
    SELECT sqlite_version() AS version;
    

    Compare against known behavior changes in FTS5

Step 5: Mitigation Strategies for Existing Systems

For applications relying on the incorrect documentation:

  1. Hotfix queries accessing id=10 to use id=1
  2. Add validation layers that compare expected vs. actual ID usage
  3. Implement fallback parsing that handles both ID locations
  4. Audit custom C extensions using direct %_data access

Step 6: Advanced Binary Format Analysis

Manually decode the hex blob from id=1:

Sample data: 000000000101010001010101
Decoding steps:
1. Skip header (4 zero bytes)
2. Read varints:
   - 0x01 → 1 (total rows)
   - 0x01 → 1 (tokens in column 1)
   - 0x01 → 1 (artifact from padding)

Note: The final varint often contains padding and should be ignored

Step 7: Custom Token Counting Verification

Cross-validate token counts using:

-- For schema: fts5(a, b)
SELECT 
  (SELECT COUNT(*) FROM tbl) AS total_rows,
  SUM(LENGTH(a) - LENGTH(REPLACE(a, ' ', '')) + 1) AS a_tokens,
  SUM(LENGTH(b) - LENGTH(REPLACE(b, ' ', '')) + 1) AS b_tokens
FROM tbl;

Compare results with decoded varints from id=1

Step 8: Handling Unindexed Columns

When using UNINDEXED columns:

CREATE VIRTUAL TABLE tbl USING fts5(a, b UNINDEXED);

The id=1 record will contain:

  • Total row count
  • Token count for indexed column ‘a’
  • Zero for unindexed column ‘b’

Step 9: Version-Specific Workarounds

For legacy systems requiring compatibility:

  1. SQLite 3.40+: Use corrected id=1 location
  2. Older versions: Implement dual-ID checks
  3. Hybrid approach:
    SELECT COALESCE(
      (SELECT block FROM tbl_data WHERE id=1),
      (SELECT block FROM tbl_data WHERE id=10)
    ) AS averages_data
    

Step 10: Long-Term Maintenance Strategy

  1. Automated schema validation:
    def verify_fts5_structure(table):
        conn.execute(f"SELECT id FROM {table}_data WHERE id=1")
        assert conn.fetchone(), "Missing averages record"
    
  2. Integration testing with known token counts
  3. Documentation monitoring through SQLite RSS feeds
  4. Community engagement via SQLite forum participation

This comprehensive analysis equips developers to correctly interpret FTS5 storage structures, implement workarounds for documentation discrepancies, and build robust text search solutions on SQLite. The resolution emphasizes practical verification techniques combined with deep structural understanding of FTS5 internals.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *