Using FTS5 with Metadata: UNINDEXED Columns vs. External Content Tables

FTS5 Indexing Strategies for Combining Searchable Content and Non-Indexed Metadata

Core Challenge: Selective Column Indexing in Full-Text Search Tables

The central technical challenge revolves around designing an SQLite FTS5 virtual table that indexes specific text content (e.g., message bodies) while excluding associated metadata (e.g., timestamps, author names) from the full-text search index. This requires balancing three objectives:

  1. Efficient full-text query performance on targeted columns
  2. Retention of non-searchable metadata within the same logical row structure
  3. Prevention of accidental metadata inclusion in FTS5 tokenization workflows

Two primary SQLite mechanisms address this: UNINDEXED column declarations and external content tables. Each approach modifies how FTS5 stores, indexes, and retrieves data, with implications for schema design, query patterns, and maintenance overhead.

Anatomy of UNINDEXED Columns

When defining an FTS5 table, appending UNINDEXED to a column definition excludes it from tokenization while retaining it as a standard SQL column. For example:

CREATE VIRTUAL TABLE messages USING fts5(
  message_body, 
  channel UNINDEXED, 
  post_date UNINDEXED, 
  author UNINDEXED
);

The message_body column undergoes FTS5 tokenization and appears in the internal messages_idx index structure. The channel, post_date, and author columns are stored as raw values in the underlying messages_content shadow table but bypass all text processing. Queries using MATCH operate exclusively on message_body, while standard SQL operators (=, LIKE) can filter UNINDEXED columns.

External Content Table Paradigm

External content tables decouple the FTS5 index from the source data storage:

-- Base table holding all data
CREATE TABLE messages_data(
  id INTEGER PRIMARY KEY,
  message_body TEXT,
  channel TEXT,
  post_date DATETIME,
  author TEXT
);

-- FTS5 table referencing external content
CREATE VIRTUAL TABLE messages_fts USING fts5(
  message_body, 
  content='messages_data',
  content_rowid='id'
);

Here, messages_fts contains only the tokenized message_body data. The content parameter links it to messages_data, with content_rowid aligning the FTS5 rowid to the base table’s primary key. The FTS5 table becomes a lightweight index that must be synchronized with the base table through triggers or application logic.

Pitfalls in Metadata Handling Across Indexing Strategies

Both approaches introduce potential failure modes that manifest during schema changes, query execution, or data synchronization:

UNINDEXED Column Misconfiguration

  • Omitting UNINDEXED qualifiers causes metadata columns to undergo tokenization, bloating the FTS5 index and enabling unintended MATCH operations on non-content fields.
  • Attempting to use MATCH on UNINDEXED columns yields silent failures (no results) since they’re absent from the FTS5 index.

External Content Synchronization Failures

  • Direct inserts/updates to the FTS5 table without corresponding base table modifications create phantom rows visible only to full-text queries.
  • Base table changes not propagated to the FTS5 table cause search results to reference stale or nonexistent content.

Schema Drift Vulnerabilities

  • Adding or modifying columns in the base table without updating associated FTS5 triggers or queries breaks referential integrity.
  • Altering UNINDEXED column types (e.g., changing post_date from DATETIME to INTEGER) in FTS5 tables requires full table rebuilds due to SQLite’s ALTER TABLE limitations.

Resolution Framework: Schema Design and Query Optimization

Addressing these issues requires methodical validation of schema definitions, trigger logic (for external content), and query patterns.

Step 1: Validate UNINDEXED Column Usage

For schemas using UNINDEXED columns, confirm that:

  • All non-content columns are explicitly marked UNINDEXED
  • The FTS5 table’s CREATE VIRTUAL TABLE statement doesn’t include tokenization options (e.g., tokenize=porter) for UNINDEXED columns, as they’re ignored

Diagnostic Query

SELECT 
  name, 
  fts5(:tableName, name) AS indexed 
FROM pragma_table_info(:tableName);

Replace :tableName with your FTS5 table name. The indexed column returns 1 for indexed (tokenized) columns and 0 for UNINDEXED.

Step 2: External Content Trigger Auditing

When using external content tables, three triggers must exist to maintain consistency:

  1. AFTER INSERT on base table: Inserts corresponding FTS5 entry
  2. AFTER UPDATE on base table: Updates corresponding FTS5 entry
  3. AFTER DELETE on base table: Deletes corresponding FTS5 entry

Trigger Implementation Example

-- After Insert
CREATE TRIGGER messages_data_ai AFTER INSERT ON messages_data BEGIN
  INSERT INTO messages_fts(rowid, message_body) VALUES (new.id, new.message_body);
END;

-- After Update
CREATE TRIGGER messages_data_au AFTER UPDATE ON messages_data BEGIN
  UPDATE messages_fts SET message_body = new.message_body WHERE rowid = old.id;
END;

-- After Delete
CREATE TRIGGER messages_data_ad AFTER DELETE ON messages_data BEGIN
  DELETE FROM messages_fts WHERE rowid = old.id;
END;

Validation Procedure

  • Confirm trigger existence via sqlite_master query:
    SELECT name, sql FROM sqlite_master 
    WHERE type = 'trigger' AND tbl_name = 'messages_data';
    
  • Test trigger functionality by inserting/updating/deleting base table rows and verifying FTS5 table changes.

Step 3: Query Pattern Enforcement

Ensure application queries adhere to the indexing strategy:

  • UNINDEXED Tables: Use MATCH only on indexed columns; filter others with standard SQL.
    -- Correct
    SELECT * FROM messages 
    WHERE messages MATCH 'search_term' AND author = 'John';
    
    -- Incorrect (MATCH on UNINDEXED column)
    SELECT * FROM messages 
    WHERE author MATCH 'John'; -- Fails silently
    
  • External Content Tables: Join FTS5 and base tables using rowid:
    SELECT m.* 
    FROM messages_data m 
    JOIN messages_fts f ON m.id = f.rowid
    WHERE f.message_body MATCH 'search_term';
    

Step 4: Migration and Schema Change Protocols

Modifying existing FTS5 schemas requires careful planning:

Adding Columns to UNINDEXED FTS5 Tables

  1. Create new FTS5 table with additional UNINDEXED columns
  2. Copy data from old FTS5 table:
    INSERT INTO new_fts SELECT * FROM old_fts;
    
  3. Recreate dependent triggers/views

Altering External Content Schemas

  1. Update base table schema
  2. Adjust FTS5 triggers to handle new/modified columns
  3. Rebuild FTS5 index:
    INSERT INTO messages_fts(messages_fts) VALUES ('rebuild');
    

Performance Optimization Techniques

  • UNINDEXED Column Storage: UNINDEXED columns are stored in the content shadow table, which uses a B-tree structure. Frequent queries filtering on UNINDEXED columns benefit from adding explicit indexes:
    CREATE INDEX messages_channel_idx ON messages(channel);
    
  • External Content Query Planning: Use EXPLAIN QUERY PLAN to verify joins utilize base table indexes:
    EXPLAIN QUERY PLAN
    SELECT m.* FROM messages_data m 
    JOIN messages_fts f ON m.id = f.rowid 
    WHERE f MATCH 'sqlite' AND m.channel = 'general';
    

    Optimal output should show SEARCH messages_fts VIRTUAL TABLE INDEX 0:M followed by SEARCH messages_data USING INTEGER PRIMARY KEY (rowid=?).

Recovery from Common Failure Scenarios

Scenario 1: Accidental Tokenization of Metadata Columns
Symptoms: Full-text queries returning results based on metadata fields; inflated index size.
Resolution:

  1. Create new FTS5 table with proper UNINDEXED designations
  2. Migrate data:
    INSERT INTO new_fts SELECT * FROM old_fts;
    
  3. Drop old table and rename new one

Scenario 2: External Content Table Orphaned Rows
Symptoms: FTS5 search returns entries not present in the base table.
Resolution:

  1. Identify orphans:
    SELECT f.rowid FROM messages_fts f
    LEFT JOIN messages_data m ON f.rowid = m.id
    WHERE m.id IS NULL;
    
  2. Delete orphans:
    DELETE FROM messages_fts WHERE rowid IN (:orphan_ids);
    
  3. Audit triggers for missing logic

Scenario 3: Inconsistent FTS5 Index After Base Table Updates
Symptoms: Search results don’t reflect recent base table changes.
Resolution:

  1. Verify trigger activity using PRAGMA trigger_list;
  2. Manually repopulate the FTS5 index:
    DELETE FROM messages_fts;
    INSERT INTO messages_fts SELECT rowid, message_body FROM messages_data;
    

Advanced Considerations: Hybrid Approaches

For scenarios requiring both UNINDEXED columns and external content benefits (e.g., large text blobs in the base table), combine the two methods:

-- Base table with large text field
CREATE TABLE messages_data(
  id INTEGER PRIMARY KEY,
  message_body TEXT,
  channel TEXT,
  post_date DATETIME,
  author TEXT
);

-- FTS5 table with UNINDEXED metadata
CREATE VIRTUAL TABLE messages_fts USING fts5(
  message_body,
  channel UNINDEXED,
  content='messages_data',
  content_rowid='id'
);

This structure stores message_body in both the base table and FTS5 index (duplicated), while channel is stored as UNINDEXED in FTS5 but also resides in the base table. Trade-offs include increased storage for duplicated columns versus reduced join complexity.

Final Schema Recommendations

  • Use UNINDEXED Columns When:

    • Metadata fields are small (e.g., strings under 255 chars)
    • Queries frequently filter on metadata without requiring joins
    • Simplified schema maintenance is preferred
  • Use External Content Tables When:

    • Base table contains large non-text data (BLOBs, JSON)
    • Existing normalization requires separate table storage
    • FTS5 index size must be minimized

By methodically applying these diagnostics, repair strategies, and design patterns, developers can robustly implement FTS5 tables with non-indexed metadata while avoiding common pitfalls in query behavior, data integrity, and index maintenance.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *