Truncated Words in FTS5Vocab Tables When Using Porter or Snowball Stemmers
Truncated Word Forms in FTS5Vocab Table Queries
The core issue arises when querying an fts5vocab virtual table built on top of an FTS5 full-text search index that employs stemming algorithms like Porter or Snowball. Users observe that specific words inserted into the base table (e.g., "Sample" and "Wallpaper") appear truncated in the fts5vocab output (e.g., "sampl" and "wallpap"). This behavior occurs despite the original words being stored correctly in the underlying content table. The problem becomes critical when integrating the fts5vocab data with auxiliary systems like the Spellfix1 extension, where the truncated stems are misinterpreted as valid correction targets. For example, a spell-checker might suggest "sampl" instead of "sample," undermining user expectations.
The workflow triggering this issue involves four components:
- A base table (e.g.,
content
) storing raw text data. - An FTS5 virtual table (e.g.,
content_fts
) configured with a stemming tokenizer. - An fts5vocab virtual table (e.g.,
content_vocab
) mapping vocabulary statistics from the FTS5 index. - A trigger automating synchronization between the base table and the FTS5 index.
When inserting terms like "sample" or "wallpaper" into the base table, the FTS5 index processes these words through the stemming algorithm before storing their normalized forms. The fts5vocab table then reflects these stemmed versions rather than the original terms. This design is intentional for search optimization but creates confusion when developers expect fts5vocab to mirror the original lexicon. The disconnect becomes evident in scenarios where vocabulary tables feed into spell-checking systems expecting complete, unmodified words.
Stemming Algorithms and Vocabulary Table Misinterpretations
The root cause of truncated words lies in the interaction between stemming algorithms and FTS5’s tokenization pipeline. Stemmers like Porter or Snowball are linguistic normalization tools that reduce inflected or derived words to their root forms. For instance, "running" becomes "run," and "wallpapers" becomes "wallpaper." However, certain stemming rules aggressively truncate suffixes even when the root form isn’t a valid English word. The Porter stemmer, for example, transforms "sample" to "sampl" by removing the final "e" after simplifying the consonant-vowel-consonant sequence. Similarly, "wallpaper" becomes "wallpap" due to rules targeting specific suffix patterns.
Three factors exacerbate the problem:
- Tokenizer Configuration: Declaring
tokenize='porter ascii'
in the FTS5 table definition activates the stemmer. Developers unaware of the stemming behavior may not anticipate vocabulary truncation. - Vocabulary Table Semantics: The fts5vocab table surfaces statistical data about the indexed terms, not the original input. This includes stemmed forms, case-folded terms, and other normalized tokens.
- Spellfix1 Integration: The Spellfix1 extension relies on a dedicated vocabulary table to suggest corrections. When populated with stemmed terms from fts5vocab, it prioritizes root forms over original spellings, leading to unintuitive suggestions.
A deeper issue is the expectation that fts5vocab serves as a direct reflection of the original dataset. In reality, FTS5 vocabulary tables expose the internal representations used by the search engine, which are optimized for query efficiency, not human readability. The stemmer’s transformations are irreversible, meaning the original terms cannot be reconstructed from the indexed data alone. Systems that depend on the original lexicon must preserve it separately from the FTS5 index.
Resolving Stemmer-Induced Truncation in FTS5 Search Indexes
To address truncated words in fts5vocab outputs while maintaining search functionality, developers can implement one of the following strategies:
1. Disable Stemming in FTS5 Tokenization
Remove the porter
or snowball
tokenizer argument when creating the FTS5 table. This forces the index to store terms verbatim, ensuring fts5vocab reflects original spellings:
CREATE VIRTUAL TABLE content_fts USING fts5(name, keywords, content='content', tokenize='ascii');
This approach eliminates stemming entirely, preserving exact matches. However, searches for "sample" will no longer match "samples" or "sampling," potentially reducing recall.
2. Maintain a Separate Unstemmed Vocabulary
Create a parallel FTS5 table without stemming dedicated to vocabulary extraction. Combine this with the stemmed index for querying:
-- Original stemmed index for search
CREATE VIRTUAL TABLE content_fts USING fts5(name, keywords, content='content', tokenize='porter ascii');
-- Unstemmed index for vocabulary
CREATE VIRTUAL TABLE content_fts_raw USING fts5(name, keywords, content='content', tokenize='ascii');
CREATE VIRTUAL TABLE content_vocab_raw USING fts5vocab('content_fts_raw', 'row');
-- Populate Spellfix1 from the unstemmed vocab
INSERT INTO spellfix1_vocab(word) SELECT term FROM content_vocab_raw;
This decouples search optimization from vocabulary preservation, allowing both systems to operate optimally.
3. Augment the Base Table with Unstemmed Terms
Store the original terms alongside their stemmed counterparts in the base table. Use triggers to populate both versions:
ALTER TABLE content ADD COLUMN keywords_raw TEXT;
CREATE TRIGGER content_ai AFTER INSERT ON content
BEGIN
INSERT INTO content_fts (name, keywords)
VALUES (new.name, stem_function(new.keywords)); -- Apply stemming programmatically
UPDATE content SET keywords_raw = new.keywords WHERE rowid = new.rowid;
END;
Here, stem_function
represents a custom stemming routine applied during insertion. The keywords_raw
column preserves the original terms for use in Spellfix1 or other systems.
4. Custom Tokenizer with Stemming Bypass
Develop a custom FTS5 tokenizer that emits both stemmed and unstemmed tokens. This requires modifying the SQLite build to include a tokenizer that duplicates terms:
// Example tokenizer emitting original and stemmed terms
static int xTokenize(
void *pCtx,
const char *pText, int nText,
Fts5Token *pToken
){
// Emit original term
pToken->pText = pText;
pToken->nText = nText;
xToken(pCtx, 0, pText, nText, 0, 0);
// Emit stemmed term
char *pStemmed = porter_stem(pText, nText);
xToken(pCtx, 0, pStemmed, strlen(pStemmed), 0, 0);
sqlite3_free(pStemmed);
return SQLITE_OK;
}
Register this tokenizer as dual_porter
and use it in the FTS5 table:
CREATE VIRTUAL TABLE content_fts USING fts5(
name, keywords,
content='content',
tokenize='dual_porter ascii'
);
The fts5vocab table will now contain both original and stemmed terms, allowing Spellfix1 to reference the unmodified versions.
5. Post-Process Vocabulary Table Outputs
If modifying the database schema is impractical, apply a reverse-stemming mapping to the fts5vocab terms before feeding them to Spellfix1. This requires maintaining a lookup table that maps stems to their most frequent original forms:
-- Example mapping table
CREATE TABLE stem_mapping (
stem TEXT PRIMARY KEY,
original TEXT
);
-- Populate by analyzing the base table
INSERT INTO stem_mapping
SELECT DISTINCT
porter_stem(keywords) AS stem,
keywords AS original
FROM content;
-- Query Spellfix1 with mapped terms
SELECT word FROM spellfix1_vocab
WHERE word IN (SELECT original FROM stem_mapping WHERE stem = ?);
This method approximates the original terms but depends on the accuracy of the mapping table.
6. Hybrid Search-Vocabulary Schema
Combine the stemmed FTS5 index with a separate Spellfix1 table populated directly from the base table:
-- Populate Spellfix1 from the base table's keywords
INSERT INTO spellfix1_vocab(word)
SELECT DISTINCT trim(keywords, ',') FROM content;
-- Split keywords into individual terms
WITH split(word) AS (
SELECT value FROM json_each('["' || replace(keywords, ',', '","') || '"]')
)
INSERT INTO spellfix1_vocab(word)
SELECT word FROM split;
This bypasses fts5vocab entirely, ensuring Spellfix1 references the original terms.
By implementing one of these solutions, developers can reconcile the conflict between search optimization (via stemming) and accurate spell-checking or vocabulary reporting. The optimal approach depends on the specific requirements for search recall, storage overhead, and system complexity.