FTS5 Averages Record Mismatch: id=1 vs. id=10 Documentation Error
Understanding the FTS5 Averages Record Storage Discrepancy
The Full-Text Search version 5 (FTS5) extension in SQLite provides advanced text search capabilities through virtual tables. A critical component of its architecture is the internal %_data
table, which stores metadata and structural information required for efficient query execution. The documentation describes a specific record within this table (id=10) as containing statistical averages and token counts across columns. However, empirical testing reveals this data is stored in the record with id=1, while id=10 contains structural metadata. This discrepancy between documented behavior and actual implementation creates confusion for developers relying on these details for low-level optimizations, debugging, or custom extensions.
Technical Breakdown of FTS5 Storage Architecture
FTS5 organizes data using multiple shadow tables, with the %_data
table serving as a central repository for:
- Averages and statistics (incorrectly documented at id=10)
- Structure configuration (correctly identified at id=10)
- Index segments (stored at higher ID values)
- Special configuration records
Each record in %_data
uses a binary format with packed varints – variable-length integers that optimize storage space. The misinterpretation of which ID stores specific data types directly impacts developers attempting to:
- Analyze index health
- Rebuild corrupted FTS5 tables manually
- Implement custom ranking algorithms
- Monitor token distribution across columns
Root Causes of the Documentation-Implementation Mismatch
1. Version-Specific Documentation Drift
The FTS5 module has undergone significant revisions since its introduction. While the core functionality remains stable, internal storage details like record ID assignments may change between minor releases without immediate documentation updates. The confusion arises when:
- Documentation describes legacy behavior from earlier FTS versions
- Implementation changes are committed without corresponding doc updates
- Edge-case handling differs between SQLite core and extensions
2. Misinterpretation of Structure vs. Statistics
The %_data
table contains multiple special-purpose records:
- id=1: Actual averages and token counts (documentation error)
- id=10: Structure version and compatibility flags
- id=0: Configuration options
- Higher IDs: B-tree segments for inverted indexes
The original documentation conflated the purpose of id=10 (structure versioning) with id=1 (statistical data), likely due to:
- Copy-paste errors from FTS4 documentation
- Mislabeled internal variable names in source code comments
- Overlapping storage formats between structure and statistics records
3. Varint Encoding Complexity
Both id=1 and id=10 records use packed varint encoding, making visual inspection of hex dumps challenging:
-- Example from original report:
id=1 | hex(block)=000000000101010001010101
id=10| hex(block)=0000000E063068656C6C6F01020204
The similar appearance of these binary blobs (both starting with zero-packed headers) increases the likelihood of misinterpretation without proper decoding tools.
Resolving the FTS5 Storage Misunderstanding
Step 1: Validate FTS5 Table Structure
Create a minimal test case to verify storage behavior:
CREATE VIRTUAL TABLE test_fts USING fts5(content);
INSERT INTO test_fts VALUES ('sample text');
SELECT id, hex(block) FROM test_fts_data;
Expected output shows:
- id=1 with 3 varints (row count + per-column token counts)
- id=10 with structure version metadata
- Higher IDs containing index segments
Step 2: Decode the id=1 Averages Record
Using SQLite’s fts5_decode()
function (requires build with -DSQLITE_DEBUG
):
SELECT fts5_decode(block) FROM test_fts_data WHERE id=1;
Output reveals the packed varints structure:
0x00 0x00 0x00 0x01 0x01 0x01 → {nRow=1, tokens=[1]}
Breakdown:
- First 4 bytes: Header (0x00000001)
- Next varint: Total rows (1)
- Subsequent varints: Token counts per column (1 token in first/only column)
Step 3: Analyze id=10 Structure Record
Decode the structure record:
SELECT fts5_decode(block) FROM test_fts_data WHERE id=10;
Output contains:
- Format version number
- Compatibility flags
- Schema checksums
- Index configuration parameters
Step 4: Update Development Practices
- Query id=1 for token statistics instead of id=10
- Avoid hardcoding IDs – Use
fts5_get()
helper function where possible:SELECT fts5_get(tbl, 'stat') FROM tbl WHERE tbl='test_fts';
- Monitor documentation updates – Subscribe to SQLite changelogs
- Implement version checks in critical code:
SELECT sqlite_version() AS version;
Compare against known behavior changes in FTS5
Step 5: Mitigation Strategies for Existing Systems
For applications relying on the incorrect documentation:
- Hotfix queries accessing id=10 to use id=1
- Add validation layers that compare expected vs. actual ID usage
- Implement fallback parsing that handles both ID locations
- Audit custom C extensions using direct
%_data
access
Step 6: Advanced Binary Format Analysis
Manually decode the hex blob from id=1:
Sample data: 000000000101010001010101
Decoding steps:
1. Skip header (4 zero bytes)
2. Read varints:
- 0x01 → 1 (total rows)
- 0x01 → 1 (tokens in column 1)
- 0x01 → 1 (artifact from padding)
Note: The final varint often contains padding and should be ignored
Step 7: Custom Token Counting Verification
Cross-validate token counts using:
-- For schema: fts5(a, b)
SELECT
(SELECT COUNT(*) FROM tbl) AS total_rows,
SUM(LENGTH(a) - LENGTH(REPLACE(a, ' ', '')) + 1) AS a_tokens,
SUM(LENGTH(b) - LENGTH(REPLACE(b, ' ', '')) + 1) AS b_tokens
FROM tbl;
Compare results with decoded varints from id=1
Step 8: Handling Unindexed Columns
When using UNINDEXED
columns:
CREATE VIRTUAL TABLE tbl USING fts5(a, b UNINDEXED);
The id=1 record will contain:
- Total row count
- Token count for indexed column ‘a’
- Zero for unindexed column ‘b’
Step 9: Version-Specific Workarounds
For legacy systems requiring compatibility:
- SQLite 3.40+: Use corrected id=1 location
- Older versions: Implement dual-ID checks
- Hybrid approach:
SELECT COALESCE( (SELECT block FROM tbl_data WHERE id=1), (SELECT block FROM tbl_data WHERE id=10) ) AS averages_data
Step 10: Long-Term Maintenance Strategy
- Automated schema validation:
def verify_fts5_structure(table): conn.execute(f"SELECT id FROM {table}_data WHERE id=1") assert conn.fetchone(), "Missing averages record"
- Integration testing with known token counts
- Documentation monitoring through SQLite RSS feeds
- Community engagement via SQLite forum participation
This comprehensive analysis equips developers to correctly interpret FTS5 storage structures, implement workarounds for documentation discrepancies, and build robust text search solutions on SQLite. The resolution emphasizes practical verification techniques combined with deep structural understanding of FTS5 internals.