Calculating Word Frequency by Year Using SQLite FTS5 and Vocabulary Tables

Understanding Term Frequency Distribution Across Years with FTS5Vocab

Architectural Context: Treatment Records and Full-Text Search

The core challenge involves analyzing the temporal distribution of specific words or phrases within a corpus of treatment records stored in SQLite. The schema comprises two critical components:

Physical Table (treatments):
- Stores metadata including unique treatment identifiers, titles, journal years, and full-text content
- Critical fields:
  - id (INTEGER PRIMARY KEY): Internal row identifier
  - treatmentId (TEXT): 32-character unique external identifier
  - journalYear (INTEGER): Publication year for temporal analysis
  - fulltext (TEXT): Unstructured content for linguistic analysis
Virtual Table (treatmentsFts):
- FTS5-powered search index covering both titles and fulltext
- Utilizes NOCASE collation for case-insensitive matching
- Employs contentless design (content='') to avoid data duplication

The operational requirement demands correlating term occurrence patterns with publication years while maintaining direct access to source documents through treatment identifiers. This necessitates bridging SQLite’s full-text search capabilities with traditional relational query patterns.

Challenges in Temporal Frequency Analysis

Three fundamental obstacles emerge when attempting year-based frequency calculations:

FTS5 Statistics Granularity:
- Built-in auxiliary functions like bm25() focus on document relevance scoring
- Inverse Document Frequency (IDF) metrics operate at corpus level without temporal context
- No native support for grouping search statistics by external metadata columns
Vocabulary Table Limitations:
- Row-type FTS5Vocab tables provide global term frequencies
- Column-type variants break down frequencies per column but lack document context
- Neither directly connects to publication year metadata
Join Operation Complexity:
- Relating virtual table statistics to physical records requires precise key mapping
- Document identifier reconciliation between FTS5 tables and source data must handle:
  - Contentless FTS table configurations
  - Potential identifier type mismatches (INTEGER vs TEXT)
  - Index coverage for efficient large-scale joins

Comprehensive Solution Framework

Phase 1: Vocabulary Table Configuration

Implement instance-type FTS5Vocab table for granular term tracking:

CREATE VIRTUAL TABLE treatmentsFtvIns 
USING fts5vocab('treatmentsFts', 'instance');

Key characteristics:

Stores individual term occurrences with document IDs
Records column location (treatmentTitle vs fulltext)
Captures positional offsets within documents

Phase 2: Temporal Aggregation Query

Join vocabulary data with source metadata for year-based counts:

SELECT t.journalYear, COUNT(ftv.doc) AS term_count
FROM treatmentsFtvIns ftv
JOIN treatments t ON ftv.doc = t.id
WHERE ftv.term = 'formica' COLLATE NOCASE
  AND ftv.col = 'fulltext'
GROUP BY t.journalYear
ORDER BY t.journalYear ASC;

Critical optimizations:

Explicit COLLATE NOCASE clause for case-insensitive matching
Column filter restricting counts to fulltext only
Aggregate using COUNT() on document IDs rather than offsets

Phase 3: Performance Enhancements

Indexing Strategy:

CREATE INDEX idx_treatments_id ON treatments(id);
CREATE INDEX idx_treatments_year ON treatments(journalYear);

Accelerates document ID lookups during JOINs
Optimizes GROUP BY operations on journalYear

Materialized Statistics:

CREATE TABLE term_year_stats (
  term TEXT COLLATE NOCASE,
  journalYear INTEGER,
  occurrence_count INTEGER,
  last_updated DATETIME,
  PRIMARY KEY (term, journalYear)
);

Precompute and refresh using triggers
Enable near-instant historical queries

Query Parallelization:
- Leverage SQLite’s worker threads for large result sets
- Utilize temp stores for intermediate join results

Advanced Implementation Considerations

Phrase Handling:
Modify tokenizer to preserve positional data:
```
CREATE VIRTUAL TABLE treatmentsFts USING fts5(
  treatmentTitle,
  fulltext,
  content='',
  tokenize = 'porter unicode61 separators " -"'
);
```
- Custom separators maintain phrase integrity
- Porter stemmer enables linguistic normalization

Frequency Weighting:
Incorporate document length normalization:

SELECT 
  t.journalYear,
  COUNT(ftv.doc) AS raw_count,
  COUNT(ftv.doc) * 1.0 / AVG(LENGTH(t.fulltext)) AS normalized_count
FROM treatmentsFtvIns ftv
JOIN treatments t ON ftv.doc = t.id
WHERE ftv.term = 'formica'
GROUP BY t.journalYear;

Mitigates bias towards longer documents
Enables fair cross-year comparisons

Temporal Smoothing:
Apply window functions for moving averages:

WITH yearly_counts AS (
  SELECT 
    journalYear,
    COUNT(doc) AS cnt
  FROM treatmentsFtvIns
  JOIN treatments ON doc = id
  WHERE term = 'formica'
  GROUP BY journalYear
)
SELECT
  journalYear,
  AVG(cnt) OVER (
    ORDER BY journalYear
    ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING
  ) AS smoothed_count
FROM yearly_counts;

Reduces noise in sparse year data
Provides trend visualization stability

Diagnostic Techniques for Common Issues

Term Mismatch Verification:

SELECT DISTINCT term FROM treatmentsFtvIns
WHERE term LIKE '%formica%' COLLATE NOCASE;

Identifies spelling variations and stemmed forms

Document Alignment Check:

SELECT COUNT(*) FROM (
  SELECT doc FROM treatmentsFtvIns
  EXCEPT
  SELECT id FROM treatments
);

Detects orphaned FTS entries
Ensures contentless FTS table consistency

Performance Profiling:

EXPLAIN QUERY PLAN
SELECT t.journalYear, COUNT(ftv.doc)
FROM treatmentsFtvIns ftv
JOIN treatments t ON ftv.doc = t.id
WHERE ftv.term = 'formica'
GROUP BY t.journalYear;

Analyze query execution steps
Verify index utilization

Alternative Approaches and Tradeoffs

Trigger-Based Counting:

CREATE TRIGGER trg_term_insert AFTER INSERT ON treatments
BEGIN
  INSERT INTO term_year_stats (term, journalYear, occurrence_count)
  SELECT 
    value,
    NEW.journalYear,
    LENGTH(NEW.fulltext) - LENGTH(REPLACE(LOWER(NEW.fulltext), value, ''))
      / LENGTH(value)
  FROM json_each(
    json_array(
      'formica', 'camponotus', 'atta'  /* Predefined term list */
    )
  )
  ON CONFLICT(term, journalYear) DO UPDATE SET
    occurrence_count = occurrence_count + excluded.occurrence_count;
END;

Pros: Real-time updates, fixed cost per insert
Cons: Requires predefined term list, inflexible to new queries

Hybrid Materialized Views:

CREATE VIEW aggregated_term_years AS
SELECT 
  ftv.term,
  t.journalYear,
  COUNT(ftv.doc) AS cnt,
  GROUP_CONCAT(t.treatmentId, '|') AS treatment_ids
FROM treatmentsFtvIns ftv
JOIN treatments t ON ftv.doc = t.id
GROUP BY ftv.term, t.journalYear;

Pros: Combines counts with document identifiers
Cons: Requires periodic refresh, storage overhead

Probabilistic Data Structures:

#include <sqlite3ext.h>
SQLITE_EXTENSION_INIT1

static void hyperloglog_step(
  sqlite3_context *context,
  int argc,
  sqlite3_value **argv
) { /* HyperLogLog implementation */ }

/* Register extension functions */
int sqlite3_extension_init(
  sqlite3 *db, 
  char **pzErrMsg, 
  const sqlite3_api_routines *pApi
) {
  sqlite3_create_function(db, "hll_count", 1, SQLITE_UTF8, 0, 0, hyperloglog_step, 0);
  return SQLITE_OK;
}

Pros: Approximate counts with minimal memory
Cons: Requires custom extensions, inexact results

Cross-Platform Optimization Patterns

Columnar Storage Techniques:

CREATE TABLE year_term_matrix (
  journalYear INTEGER PRIMARY KEY,
  formica INTEGER DEFAULT 0,
  camponotus INTEGER DEFAULT 0,
  atta INTEGER DEFAULT 0
);

Enables vectorized operations on term counts
Efficient for fixed taxonomy analyses

Sparse Temporal Indexing:

CREATE VIRTUAL TABLE year_spans USING rtree(
  year_start,
  year_end,
  doc_ids BLOB  /* Compressed list of document IDs */
);

Accelerates range queries over years
Compact storage for temporal document sets

In-Memory Acceleration:

ATTACH DATABASE ':memory:' AS mem;
CREATE TABLE mem.term_cache AS
SELECT term, doc, col, offset FROM treatmentsFtvIns;

Bypass virtual table I/O overhead
Volatile but fast for session-based analysis

Long-Term Maintenance Considerations

Vocabulary Table Replication:

CREATE TABLE ftvins_snapshot (
  term TEXT,
  doc INTEGER,
  col TEXT,
  offset INTEGER,
  snapshot_date DATETIME
);

INSERT INTO ftvins_snapshot
SELECT *, datetime('now') FROM treatmentsFtvIns;

Enables historical trend comparisons
Requires periodic archiving strategy

Query Plan Analysis:

WITH RECURSIVE
  exec_plan AS (
    SELECT * FROM sqlite_stmt WHERE sql LIKE '%treatmentsFtvIns%'
  )
SELECT 
  stmt.sql,
  stmt.plan
FROM exec_plan;

Monitor index usage efficiency
Detect plan regressions over schema changes

Compression Strategies:
```
PRAGMA auto_vacuum = INCREMENTAL;
PRAGMA incremental_vacuum;
```
- Manage database growth from FTS auxiliary tables
- Balance storage footprint with query performance

Final Implementation Checklist

Schema Validation:
- Verify FTS5 table uses contentless mode properly
- Confirm treatmentId UNIQUE constraint functionality
- Check collation sequence consistency across joins
Index Coverage Audit:
- Ensure existence of treatments.id index
- Validate journalYear index selectivity
- Consider composite indexes on frequent query patterns
Performance Benchmarking:
- Establish baseline query times for sample terms
- Measure memory usage during large aggregations
- Profile disk I/O patterns during joins
Data Quality Assurance:
- Reconcile FTS document counts with base table
- Validate year ranges against known publication dates
- Spot-check term offsets in sample documents

This comprehensive approach enables robust temporal linguistic analysis within SQLite’s constraints, balancing query flexibility with performance considerations. The solution scales through strategic indexing, materialized aggregates, and careful schema design while maintaining direct access to source documents for drill-down capabilities.

Calculating Word Frequency by Year Using SQLite FTS5 and Vocabulary Tables

Understanding Term Frequency Distribution Across Years with FTS5Vocab

Architectural Context: Treatment Records and Full-Text Search

Challenges in Temporal Frequency Analysis

Comprehensive Solution Framework

FTS5 Phrase Queries with Custom Tokenizers and Escaping Reserved Characters

JSON Array Extraction with Negative Indexes in SQLite

Optimizing Order Status Updates and Product Quantity Management in SQLite

Breadth-First Graph Traversal in SQLite: Performance and Optimization

Handling Embedded NUL Characters in SQLite FTS5 Tokenizers

Enhancing SQLite with JSON Output Formatting for Embedded Systems

Leave a Reply Cancel reply

Understanding Term Frequency Distribution Across Years with FTS5Vocab

Architectural Context: Treatment Records and Full-Text Search

Challenges in Temporal Frequency Analysis

Comprehensive Solution Framework

Related Guides

Leave a Reply Cancel reply