Calculating Word Frequency by Year Using SQLite FTS5 and Vocabulary Tables

Understanding Term Frequency Distribution Across Years with FTS5Vocab

Architectural Context: Treatment Records and Full-Text Search

The core challenge involves analyzing the temporal distribution of specific words or phrases within a corpus of treatment records stored in SQLite. The schema comprises two critical components:

  1. Physical Table (treatments):

    • Stores metadata including unique treatment identifiers, titles, journal years, and full-text content
    • Critical fields:
      • id (INTEGER PRIMARY KEY): Internal row identifier
      • treatmentId (TEXT): 32-character unique external identifier
      • journalYear (INTEGER): Publication year for temporal analysis
      • fulltext (TEXT): Unstructured content for linguistic analysis
  2. Virtual Table (treatmentsFts):

    • FTS5-powered search index covering both titles and fulltext
    • Utilizes NOCASE collation for case-insensitive matching
    • Employs contentless design (content='') to avoid data duplication

The operational requirement demands correlating term occurrence patterns with publication years while maintaining direct access to source documents through treatment identifiers. This necessitates bridging SQLite’s full-text search capabilities with traditional relational query patterns.

Challenges in Temporal Frequency Analysis

Three fundamental obstacles emerge when attempting year-based frequency calculations:

  1. FTS5 Statistics Granularity:

    • Built-in auxiliary functions like bm25() focus on document relevance scoring
    • Inverse Document Frequency (IDF) metrics operate at corpus level without temporal context
    • No native support for grouping search statistics by external metadata columns
  2. Vocabulary Table Limitations:

    • Row-type FTS5Vocab tables provide global term frequencies
    • Column-type variants break down frequencies per column but lack document context
    • Neither directly connects to publication year metadata
  3. Join Operation Complexity:

    • Relating virtual table statistics to physical records requires precise key mapping
    • Document identifier reconciliation between FTS5 tables and source data must handle:
      • Contentless FTS table configurations
      • Potential identifier type mismatches (INTEGER vs TEXT)
      • Index coverage for efficient large-scale joins

Comprehensive Solution Framework

Phase 1: Vocabulary Table Configuration

Implement instance-type FTS5Vocab table for granular term tracking:

CREATE VIRTUAL TABLE treatmentsFtvIns 
USING fts5vocab('treatmentsFts', 'instance');

Key characteristics:

  • Stores individual term occurrences with document IDs
  • Records column location (treatmentTitle vs fulltext)
  • Captures positional offsets within documents

Phase 2: Temporal Aggregation Query

Join vocabulary data with source metadata for year-based counts:

SELECT t.journalYear, COUNT(ftv.doc) AS term_count
FROM treatmentsFtvIns ftv
JOIN treatments t ON ftv.doc = t.id
WHERE ftv.term = 'formica' COLLATE NOCASE
  AND ftv.col = 'fulltext'
GROUP BY t.journalYear
ORDER BY t.journalYear ASC;

Critical optimizations:

  • Explicit COLLATE NOCASE clause for case-insensitive matching
  • Column filter restricting counts to fulltext only
  • Aggregate using COUNT() on document IDs rather than offsets

Phase 3: Performance Enhancements

  1. Indexing Strategy:

    CREATE INDEX idx_treatments_id ON treatments(id);
    CREATE INDEX idx_treatments_year ON treatments(journalYear);
    
    • Accelerates document ID lookups during JOINs
    • Optimizes GROUP BY operations on journalYear
  2. Materialized Statistics:

    CREATE TABLE term_year_stats (
      term TEXT COLLATE NOCASE,
      journalYear INTEGER,
      occurrence_count INTEGER,
      last_updated DATETIME,
      PRIMARY KEY (term, journalYear)
    );
    
    • Precompute and refresh using triggers
    • Enable near-instant historical queries
  3. Query Parallelization:

    • Leverage SQLite’s worker threads for large result sets
    • Utilize temp stores for intermediate join results

Advanced Implementation Considerations

  1. Phrase Handling:
    Modify tokenizer to preserve positional data:

    CREATE VIRTUAL TABLE treatmentsFts USING fts5(
      treatmentTitle,
      fulltext,
      content='',
      tokenize = 'porter unicode61 separators " -"'
    );
    
    • Custom separators maintain phrase integrity
    • Porter stemmer enables linguistic normalization
  2. Frequency Weighting:
    Incorporate document length normalization:

    SELECT 
      t.journalYear,
      COUNT(ftv.doc) AS raw_count,
      COUNT(ftv.doc) * 1.0 / AVG(LENGTH(t.fulltext)) AS normalized_count
    FROM treatmentsFtvIns ftv
    JOIN treatments t ON ftv.doc = t.id
    WHERE ftv.term = 'formica'
    GROUP BY t.journalYear;
    
    • Mitigates bias towards longer documents
    • Enables fair cross-year comparisons
  3. Temporal Smoothing:
    Apply window functions for moving averages:

    WITH yearly_counts AS (
      SELECT 
        journalYear,
        COUNT(doc) AS cnt
      FROM treatmentsFtvIns
      JOIN treatments ON doc = id
      WHERE term = 'formica'
      GROUP BY journalYear
    )
    SELECT
      journalYear,
      AVG(cnt) OVER (
        ORDER BY journalYear
        ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING
      ) AS smoothed_count
    FROM yearly_counts;
    
    • Reduces noise in sparse year data
    • Provides trend visualization stability

Diagnostic Techniques for Common Issues

  1. Term Mismatch Verification:

    SELECT DISTINCT term FROM treatmentsFtvIns
    WHERE term LIKE '%formica%' COLLATE NOCASE;
    
    • Identifies spelling variations and stemmed forms
  2. Document Alignment Check:

    SELECT COUNT(*) FROM (
      SELECT doc FROM treatmentsFtvIns
      EXCEPT
      SELECT id FROM treatments
    );
    
    • Detects orphaned FTS entries
    • Ensures contentless FTS table consistency
  3. Performance Profiling:

    EXPLAIN QUERY PLAN
    SELECT t.journalYear, COUNT(ftv.doc)
    FROM treatmentsFtvIns ftv
    JOIN treatments t ON ftv.doc = t.id
    WHERE ftv.term = 'formica'
    GROUP BY t.journalYear;
    
    • Analyze query execution steps
    • Verify index utilization

Alternative Approaches and Tradeoffs

  1. Trigger-Based Counting:

    CREATE TRIGGER trg_term_insert AFTER INSERT ON treatments
    BEGIN
      INSERT INTO term_year_stats (term, journalYear, occurrence_count)
      SELECT 
        value,
        NEW.journalYear,
        LENGTH(NEW.fulltext) - LENGTH(REPLACE(LOWER(NEW.fulltext), value, ''))
          / LENGTH(value)
      FROM json_each(
        json_array(
          'formica', 'camponotus', 'atta'  /* Predefined term list */
        )
      )
      ON CONFLICT(term, journalYear) DO UPDATE SET
        occurrence_count = occurrence_count + excluded.occurrence_count;
    END;
    
    • Pros: Real-time updates, fixed cost per insert
    • Cons: Requires predefined term list, inflexible to new queries
  2. Hybrid Materialized Views:

    CREATE VIEW aggregated_term_years AS
    SELECT 
      ftv.term,
      t.journalYear,
      COUNT(ftv.doc) AS cnt,
      GROUP_CONCAT(t.treatmentId, '|') AS treatment_ids
    FROM treatmentsFtvIns ftv
    JOIN treatments t ON ftv.doc = t.id
    GROUP BY ftv.term, t.journalYear;
    
    • Pros: Combines counts with document identifiers
    • Cons: Requires periodic refresh, storage overhead
  3. Probabilistic Data Structures:

    #include <sqlite3ext.h>
    SQLITE_EXTENSION_INIT1
    
    static void hyperloglog_step(
      sqlite3_context *context,
      int argc,
      sqlite3_value **argv
    ) { /* HyperLogLog implementation */ }
    
    /* Register extension functions */
    int sqlite3_extension_init(
      sqlite3 *db, 
      char **pzErrMsg, 
      const sqlite3_api_routines *pApi
    ) {
      sqlite3_create_function(db, "hll_count", 1, SQLITE_UTF8, 0, 0, hyperloglog_step, 0);
      return SQLITE_OK;
    }
    
    • Pros: Approximate counts with minimal memory
    • Cons: Requires custom extensions, inexact results

Cross-Platform Optimization Patterns

  1. Columnar Storage Techniques:

    CREATE TABLE year_term_matrix (
      journalYear INTEGER PRIMARY KEY,
      formica INTEGER DEFAULT 0,
      camponotus INTEGER DEFAULT 0,
      atta INTEGER DEFAULT 0
    );
    
    • Enables vectorized operations on term counts
    • Efficient for fixed taxonomy analyses
  2. Sparse Temporal Indexing:

    CREATE VIRTUAL TABLE year_spans USING rtree(
      year_start,
      year_end,
      doc_ids BLOB  /* Compressed list of document IDs */
    );
    
    • Accelerates range queries over years
    • Compact storage for temporal document sets
  3. In-Memory Acceleration:

    ATTACH DATABASE ':memory:' AS mem;
    CREATE TABLE mem.term_cache AS
    SELECT term, doc, col, offset FROM treatmentsFtvIns;
    
    • Bypass virtual table I/O overhead
    • Volatile but fast for session-based analysis

Long-Term Maintenance Considerations

  1. Vocabulary Table Replication:

    CREATE TABLE ftvins_snapshot (
      term TEXT,
      doc INTEGER,
      col TEXT,
      offset INTEGER,
      snapshot_date DATETIME
    );
    
    INSERT INTO ftvins_snapshot
    SELECT *, datetime('now') FROM treatmentsFtvIns;
    
    • Enables historical trend comparisons
    • Requires periodic archiving strategy
  2. Query Plan Analysis:

    WITH RECURSIVE
      exec_plan AS (
        SELECT * FROM sqlite_stmt WHERE sql LIKE '%treatmentsFtvIns%'
      )
    SELECT 
      stmt.sql,
      stmt.plan
    FROM exec_plan;
    
    • Monitor index usage efficiency
    • Detect plan regressions over schema changes
  3. Compression Strategies:

    PRAGMA auto_vacuum = INCREMENTAL;
    PRAGMA incremental_vacuum;
    
    • Manage database growth from FTS auxiliary tables
    • Balance storage footprint with query performance

Final Implementation Checklist

  1. Schema Validation:

    • Verify FTS5 table uses contentless mode properly
    • Confirm treatmentId UNIQUE constraint functionality
    • Check collation sequence consistency across joins
  2. Index Coverage Audit:

    • Ensure existence of treatments.id index
    • Validate journalYear index selectivity
    • Consider composite indexes on frequent query patterns
  3. Performance Benchmarking:

    • Establish baseline query times for sample terms
    • Measure memory usage during large aggregations
    • Profile disk I/O patterns during joins
  4. Data Quality Assurance:

    • Reconcile FTS document counts with base table
    • Validate year ranges against known publication dates
    • Spot-check term offsets in sample documents

This comprehensive approach enables robust temporal linguistic analysis within SQLite’s constraints, balancing query flexibility with performance considerations. The solution scales through strategic indexing, materialized aggregates, and careful schema design while maintaining direct access to source documents for drill-down capabilities.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *