FTS5 Averages Record Mismatch: id=1 vs. id=10 Documentation Error

Understanding the FTS5 Averages Record Storage Discrepancy

The Full-Text Search version 5 (FTS5) extension in SQLite provides advanced text search capabilities through virtual tables. A critical component of its architecture is the internal %_data table, which stores metadata and structural information required for efficient query execution. The documentation describes a specific record within this table (id=10) as containing statistical averages and token counts across columns. However, empirical testing reveals this data is stored in the record with id=1, while id=10 contains structural metadata. This discrepancy between documented behavior and actual implementation creates confusion for developers relying on these details for low-level optimizations, debugging, or custom extensions.

Technical Breakdown of FTS5 Storage Architecture

FTS5 organizes data using multiple shadow tables, with the %_data table serving as a central repository for:

Averages and statistics (incorrectly documented at id=10)
Structure configuration (correctly identified at id=10)
Index segments (stored at higher ID values)
Special configuration records

Each record in %_data uses a binary format with packed varints – variable-length integers that optimize storage space. The misinterpretation of which ID stores specific data types directly impacts developers attempting to:

Analyze index health
Rebuild corrupted FTS5 tables manually
Implement custom ranking algorithms
Monitor token distribution across columns

Root Causes of the Documentation-Implementation Mismatch

1. Version-Specific Documentation Drift

The FTS5 module has undergone significant revisions since its introduction. While the core functionality remains stable, internal storage details like record ID assignments may change between minor releases without immediate documentation updates. The confusion arises when:

Documentation describes legacy behavior from earlier FTS versions
Implementation changes are committed without corresponding doc updates
Edge-case handling differs between SQLite core and extensions

2. Misinterpretation of Structure vs. Statistics

The %_data table contains multiple special-purpose records:

id=1: Actual averages and token counts (documentation error)
id=10: Structure version and compatibility flags
id=0: Configuration options
Higher IDs: B-tree segments for inverted indexes

The original documentation conflated the purpose of id=10 (structure versioning) with id=1 (statistical data), likely due to:

Copy-paste errors from FTS4 documentation
Mislabeled internal variable names in source code comments
Overlapping storage formats between structure and statistics records

3. Varint Encoding Complexity

Both id=1 and id=10 records use packed varint encoding, making visual inspection of hex dumps challenging:

-- Example from original report:
id=1 | hex(block)=000000000101010001010101
id=10| hex(block)=0000000E063068656C6C6F01020204

The similar appearance of these binary blobs (both starting with zero-packed headers) increases the likelihood of misinterpretation without proper decoding tools.

Resolving the FTS5 Storage Misunderstanding

Step 1: Validate FTS5 Table Structure

Create a minimal test case to verify storage behavior:

CREATE VIRTUAL TABLE test_fts USING fts5(content);
INSERT INTO test_fts VALUES ('sample text');
SELECT id, hex(block) FROM test_fts_data;

Expected output shows:

id=1 with 3 varints (row count + per-column token counts)
id=10 with structure version metadata
Higher IDs containing index segments

Step 2: Decode the id=1 Averages Record

Using SQLite’s fts5_decode() function (requires build with -DSQLITE_DEBUG):

SELECT fts5_decode(block) FROM test_fts_data WHERE id=1;

Output reveals the packed varints structure:

0x00 0x00 0x00 0x01 0x01 0x01 → {nRow=1, tokens=[1]}

Breakdown:

First 4 bytes: Header (0x00000001)
Next varint: Total rows (1)
Subsequent varints: Token counts per column (1 token in first/only column)

Step 3: Analyze id=10 Structure Record

Decode the structure record:

SELECT fts5_decode(block) FROM test_fts_data WHERE id=10;

Output contains:

Format version number
Compatibility flags
Schema checksums
Index configuration parameters

Step 4: Update Development Practices

Query id=1 for token statistics instead of id=10
Avoid hardcoding IDs – Use fts5_get() helper function where possible:
```
SELECT fts5_get(tbl, 'stat') FROM tbl WHERE tbl='test_fts';
```
Monitor documentation updates – Subscribe to SQLite changelogs
Implement version checks in critical code:
```
SELECT sqlite_version() AS version;
```
Compare against known behavior changes in FTS5

Step 5: Mitigation Strategies for Existing Systems

For applications relying on the incorrect documentation:

Hotfix queries accessing id=10 to use id=1
Add validation layers that compare expected vs. actual ID usage
Implement fallback parsing that handles both ID locations
Audit custom C extensions using direct %_data access

Step 6: Advanced Binary Format Analysis

Manually decode the hex blob from id=1:

Sample data: 000000000101010001010101
Decoding steps:
1. Skip header (4 zero bytes)
2. Read varints:
   - 0x01 → 1 (total rows)
   - 0x01 → 1 (tokens in column 1)
   - 0x01 → 1 (artifact from padding)

Note: The final varint often contains padding and should be ignored

Step 7: Custom Token Counting Verification

Cross-validate token counts using:

-- For schema: fts5(a, b)
SELECT 
  (SELECT COUNT(*) FROM tbl) AS total_rows,
  SUM(LENGTH(a) - LENGTH(REPLACE(a, ' ', '')) + 1) AS a_tokens,
  SUM(LENGTH(b) - LENGTH(REPLACE(b, ' ', '')) + 1) AS b_tokens
FROM tbl;

Compare results with decoded varints from id=1

Step 8: Handling Unindexed Columns

When using UNINDEXED columns:

CREATE VIRTUAL TABLE tbl USING fts5(a, b UNINDEXED);

The id=1 record will contain:

Total row count
Token count for indexed column ‘a’
Zero for unindexed column ‘b’

Step 9: Version-Specific Workarounds

For legacy systems requiring compatibility:

SQLite 3.40+: Use corrected id=1 location
Older versions: Implement dual-ID checks

Hybrid approach:

SELECT COALESCE(
  (SELECT block FROM tbl_data WHERE id=1),
  (SELECT block FROM tbl_data WHERE id=10)
) AS averages_data

Step 10: Long-Term Maintenance Strategy

Automated schema validation:

def verify_fts5_structure(table):
    conn.execute(f"SELECT id FROM {table}_data WHERE id=1")
    assert conn.fetchone(), "Missing averages record"

Integration testing with known token counts
Documentation monitoring through SQLite RSS feeds
Community engagement via SQLite forum participation

This comprehensive analysis equips developers to correctly interpret FTS5 storage structures, implement workarounds for documentation discrepancies, and build robust text search solutions on SQLite. The resolution emphasizes practical verification techniques combined with deep structural understanding of FTS5 internals.

FTS5 Averages Record Mismatch: id=1 vs. id=10 Documentation Error

Understanding the FTS5 Averages Record Storage Discrepancy

Technical Breakdown of FTS5 Storage Architecture

Root Causes of the Documentation-Implementation Mismatch

1. Version-Specific Documentation Drift

2. Misinterpretation of Structure vs. Statistics

3. Varint Encoding Complexity

Resolving the FTS5 Storage Misunderstanding

Step 1: Validate FTS5 Table Structure

Step 2: Decode the id=1 Averages Record

Step 3: Analyze id=10 Structure Record

Step 4: Update Development Practices

Step 5: Mitigation Strategies for Existing Systems

Step 6: Advanced Binary Format Analysis

Step 7: Custom Token Counting Verification

Step 8: Handling Unindexed Columns

Step 9: Version-Specific Workarounds

Step 10: Long-Term Maintenance Strategy

Handling Unknown and Ambiguous Data in a Music Database Schema

Migrating and Managing Thousands of Markdown Files in SQLite

Inconsistencies in Query Behavior Due to Incorrect ALTER TABLE Syntax and Indexed Columns

Exporting SQLite Schema to SQLAR Archive: Issues and Solutions

Designing a Relational Database for Invoices, Quotes, and Customers in SQLite

CREATE VIEW Behavior with Existing View-Names in SQLite

Leave a Reply Cancel reply

Understanding the FTS5 Averages Record Storage Discrepancy

Technical Breakdown of FTS5 Storage Architecture

Root Causes of the Documentation-Implementation Mismatch

1. Version-Specific Documentation Drift

2. Misinterpretation of Structure vs. Statistics

3. Varint Encoding Complexity

Resolving the FTS5 Storage Misunderstanding

Step 1: Validate FTS5 Table Structure

Step 2: Decode the id=1 Averages Record

Step 3: Analyze id=10 Structure Record

Step 4: Update Development Practices

Step 5: Mitigation Strategies for Existing Systems

Step 6: Advanced Binary Format Analysis

Step 7: Custom Token Counting Verification

Step 8: Handling Unindexed Columns

Step 9: Version-Specific Workarounds

Step 10: Long-Term Maintenance Strategy

Related Guides

Leave a Reply Cancel reply