Calculating Word Frequency by Year Using SQLite FTS5 and Vocabulary Tables
Understanding Term Frequency Distribution Across Years with FTS5Vocab
Architectural Context: Treatment Records and Full-Text Search
The core challenge involves analyzing the temporal distribution of specific words or phrases within a corpus of treatment records stored in SQLite. The schema comprises two critical components:
Physical Table (
treatments
):- Stores metadata including unique treatment identifiers, titles, journal years, and full-text content
- Critical fields:
id
(INTEGER PRIMARY KEY): Internal row identifiertreatmentId
(TEXT): 32-character unique external identifierjournalYear
(INTEGER): Publication year for temporal analysisfulltext
(TEXT): Unstructured content for linguistic analysis
Virtual Table (
treatmentsFts
):- FTS5-powered search index covering both titles and fulltext
- Utilizes NOCASE collation for case-insensitive matching
- Employs contentless design (
content=''
) to avoid data duplication
The operational requirement demands correlating term occurrence patterns with publication years while maintaining direct access to source documents through treatment identifiers. This necessitates bridging SQLite’s full-text search capabilities with traditional relational query patterns.
Challenges in Temporal Frequency Analysis
Three fundamental obstacles emerge when attempting year-based frequency calculations:
FTS5 Statistics Granularity:
- Built-in auxiliary functions like
bm25()
focus on document relevance scoring - Inverse Document Frequency (IDF) metrics operate at corpus level without temporal context
- No native support for grouping search statistics by external metadata columns
- Built-in auxiliary functions like
Vocabulary Table Limitations:
- Row-type FTS5Vocab tables provide global term frequencies
- Column-type variants break down frequencies per column but lack document context
- Neither directly connects to publication year metadata
Join Operation Complexity:
- Relating virtual table statistics to physical records requires precise key mapping
- Document identifier reconciliation between FTS5 tables and source data must handle:
- Contentless FTS table configurations
- Potential identifier type mismatches (INTEGER vs TEXT)
- Index coverage for efficient large-scale joins
Comprehensive Solution Framework
Phase 1: Vocabulary Table Configuration
Implement instance-type FTS5Vocab table for granular term tracking:
CREATE VIRTUAL TABLE treatmentsFtvIns
USING fts5vocab('treatmentsFts', 'instance');
Key characteristics:
- Stores individual term occurrences with document IDs
- Records column location (treatmentTitle vs fulltext)
- Captures positional offsets within documents
Phase 2: Temporal Aggregation Query
Join vocabulary data with source metadata for year-based counts:
SELECT t.journalYear, COUNT(ftv.doc) AS term_count
FROM treatmentsFtvIns ftv
JOIN treatments t ON ftv.doc = t.id
WHERE ftv.term = 'formica' COLLATE NOCASE
AND ftv.col = 'fulltext'
GROUP BY t.journalYear
ORDER BY t.journalYear ASC;
Critical optimizations:
- Explicit COLLATE NOCASE clause for case-insensitive matching
- Column filter restricting counts to fulltext only
- Aggregate using COUNT() on document IDs rather than offsets
Phase 3: Performance Enhancements
Indexing Strategy:
CREATE INDEX idx_treatments_id ON treatments(id); CREATE INDEX idx_treatments_year ON treatments(journalYear);
- Accelerates document ID lookups during JOINs
- Optimizes GROUP BY operations on journalYear
Materialized Statistics:
CREATE TABLE term_year_stats ( term TEXT COLLATE NOCASE, journalYear INTEGER, occurrence_count INTEGER, last_updated DATETIME, PRIMARY KEY (term, journalYear) );
- Precompute and refresh using triggers
- Enable near-instant historical queries
Query Parallelization:
- Leverage SQLite’s worker threads for large result sets
- Utilize temp stores for intermediate join results
Advanced Implementation Considerations
Phrase Handling:
Modify tokenizer to preserve positional data:CREATE VIRTUAL TABLE treatmentsFts USING fts5( treatmentTitle, fulltext, content='', tokenize = 'porter unicode61 separators " -"' );
- Custom separators maintain phrase integrity
- Porter stemmer enables linguistic normalization
Frequency Weighting:
Incorporate document length normalization:SELECT t.journalYear, COUNT(ftv.doc) AS raw_count, COUNT(ftv.doc) * 1.0 / AVG(LENGTH(t.fulltext)) AS normalized_count FROM treatmentsFtvIns ftv JOIN treatments t ON ftv.doc = t.id WHERE ftv.term = 'formica' GROUP BY t.journalYear;
- Mitigates bias towards longer documents
- Enables fair cross-year comparisons
Temporal Smoothing:
Apply window functions for moving averages:WITH yearly_counts AS ( SELECT journalYear, COUNT(doc) AS cnt FROM treatmentsFtvIns JOIN treatments ON doc = id WHERE term = 'formica' GROUP BY journalYear ) SELECT journalYear, AVG(cnt) OVER ( ORDER BY journalYear ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING ) AS smoothed_count FROM yearly_counts;
- Reduces noise in sparse year data
- Provides trend visualization stability
Diagnostic Techniques for Common Issues
Term Mismatch Verification:
SELECT DISTINCT term FROM treatmentsFtvIns WHERE term LIKE '%formica%' COLLATE NOCASE;
- Identifies spelling variations and stemmed forms
Document Alignment Check:
SELECT COUNT(*) FROM ( SELECT doc FROM treatmentsFtvIns EXCEPT SELECT id FROM treatments );
- Detects orphaned FTS entries
- Ensures contentless FTS table consistency
Performance Profiling:
EXPLAIN QUERY PLAN SELECT t.journalYear, COUNT(ftv.doc) FROM treatmentsFtvIns ftv JOIN treatments t ON ftv.doc = t.id WHERE ftv.term = 'formica' GROUP BY t.journalYear;
- Analyze query execution steps
- Verify index utilization
Alternative Approaches and Tradeoffs
Trigger-Based Counting:
CREATE TRIGGER trg_term_insert AFTER INSERT ON treatments BEGIN INSERT INTO term_year_stats (term, journalYear, occurrence_count) SELECT value, NEW.journalYear, LENGTH(NEW.fulltext) - LENGTH(REPLACE(LOWER(NEW.fulltext), value, '')) / LENGTH(value) FROM json_each( json_array( 'formica', 'camponotus', 'atta' /* Predefined term list */ ) ) ON CONFLICT(term, journalYear) DO UPDATE SET occurrence_count = occurrence_count + excluded.occurrence_count; END;
- Pros: Real-time updates, fixed cost per insert
- Cons: Requires predefined term list, inflexible to new queries
Hybrid Materialized Views:
CREATE VIEW aggregated_term_years AS SELECT ftv.term, t.journalYear, COUNT(ftv.doc) AS cnt, GROUP_CONCAT(t.treatmentId, '|') AS treatment_ids FROM treatmentsFtvIns ftv JOIN treatments t ON ftv.doc = t.id GROUP BY ftv.term, t.journalYear;
- Pros: Combines counts with document identifiers
- Cons: Requires periodic refresh, storage overhead
Probabilistic Data Structures:
#include <sqlite3ext.h> SQLITE_EXTENSION_INIT1 static void hyperloglog_step( sqlite3_context *context, int argc, sqlite3_value **argv ) { /* HyperLogLog implementation */ } /* Register extension functions */ int sqlite3_extension_init( sqlite3 *db, char **pzErrMsg, const sqlite3_api_routines *pApi ) { sqlite3_create_function(db, "hll_count", 1, SQLITE_UTF8, 0, 0, hyperloglog_step, 0); return SQLITE_OK; }
- Pros: Approximate counts with minimal memory
- Cons: Requires custom extensions, inexact results
Cross-Platform Optimization Patterns
Columnar Storage Techniques:
CREATE TABLE year_term_matrix ( journalYear INTEGER PRIMARY KEY, formica INTEGER DEFAULT 0, camponotus INTEGER DEFAULT 0, atta INTEGER DEFAULT 0 );
- Enables vectorized operations on term counts
- Efficient for fixed taxonomy analyses
Sparse Temporal Indexing:
CREATE VIRTUAL TABLE year_spans USING rtree( year_start, year_end, doc_ids BLOB /* Compressed list of document IDs */ );
- Accelerates range queries over years
- Compact storage for temporal document sets
In-Memory Acceleration:
ATTACH DATABASE ':memory:' AS mem; CREATE TABLE mem.term_cache AS SELECT term, doc, col, offset FROM treatmentsFtvIns;
- Bypass virtual table I/O overhead
- Volatile but fast for session-based analysis
Long-Term Maintenance Considerations
Vocabulary Table Replication:
CREATE TABLE ftvins_snapshot ( term TEXT, doc INTEGER, col TEXT, offset INTEGER, snapshot_date DATETIME ); INSERT INTO ftvins_snapshot SELECT *, datetime('now') FROM treatmentsFtvIns;
- Enables historical trend comparisons
- Requires periodic archiving strategy
Query Plan Analysis:
WITH RECURSIVE exec_plan AS ( SELECT * FROM sqlite_stmt WHERE sql LIKE '%treatmentsFtvIns%' ) SELECT stmt.sql, stmt.plan FROM exec_plan;
- Monitor index usage efficiency
- Detect plan regressions over schema changes
Compression Strategies:
PRAGMA auto_vacuum = INCREMENTAL; PRAGMA incremental_vacuum;
- Manage database growth from FTS auxiliary tables
- Balance storage footprint with query performance
Final Implementation Checklist
Schema Validation:
- Verify FTS5 table uses contentless mode properly
- Confirm treatmentId UNIQUE constraint functionality
- Check collation sequence consistency across joins
Index Coverage Audit:
- Ensure existence of treatments.id index
- Validate journalYear index selectivity
- Consider composite indexes on frequent query patterns
Performance Benchmarking:
- Establish baseline query times for sample terms
- Measure memory usage during large aggregations
- Profile disk I/O patterns during joins
Data Quality Assurance:
- Reconcile FTS document counts with base table
- Validate year ranges against known publication dates
- Spot-check term offsets in sample documents
This comprehensive approach enables robust temporal linguistic analysis within SQLite’s constraints, balancing query flexibility with performance considerations. The solution scales through strategic indexing, materialized aggregates, and careful schema design while maintaining direct access to source documents for drill-down capabilities.