FTS5 Porter Stemmer and Unicode61 Tokenizer Conflicts in Non-English Text Searches

Issue Overview: Inconsistent Search Results with FTS5 Porter-Unicode61 Tokenizer in Polish Text

A user reported unexpected behavior when querying Polish-language lyrics stored in an SQLite FTS5 virtual table configured with the tokenize="porter unicode61 remove_diacritics 2" option. Searches for terms like piecy* (intended to match words starting with "piecy") returned inconsistent results. For example, piecy* matched "pięciu" (the Polish word for "five") but not "Piecyka" (a proper noun), while piecyk* returned "Piecyka" but not "pięciu." The database schema uses a combination of the Porter stemmer and Unicode61 tokenizer with diacritic removal. The core issue revolves around the interplay between these tokenization components and their compatibility with Polish-language morphology.

The problem arises from three intersecting factors:

Language-Specific Stemming: The Porter stemming algorithm is designed for English, not Polish. It incorrectly reduces Polish words to stems, leading to mismatches between query terms and indexed tokens.
Diacritic Removal Conflicts: The remove_diacritics 2 option strips accent marks from characters, altering Polish words in ways that conflict with their grammatical rules (e.g., "pięciu" becomes "pieciu").
Tokenizer Order and Configuration: The syntax porter unicode61 suggests an attempt to combine tokenizers, but FTS5 interprets this as a single tokenizer name, defaulting to unintended behavior.

These factors create a mismatch between the indexed tokens and the query logic, causing incomplete or irrelevant search results.

Root Causes: Porter Stemmer Limitations and Tokenizer Misconfiguration

1. Porter Stemmer’s Incompatibility with Polish Morphology

The Porter stemmer is optimized for English verb conjugations and noun pluralization. Polish morphology involves complex inflectional patterns (e.g., seven grammatical cases, gender agreements) that the Porter algorithm cannot handle. For example:

"pięciu" (genitive plural of "pięć") is stemmed to "piec" (the Polish word for "stove" or "furnace"), not "pięciu."
"Piecyka" (a diminutive form) is reduced to "piecyk," but the Porter stemmer may further truncate it to "piec," creating false positives.

This results in overstemming (aggressive truncation that merges unrelated words) or understemming (failure to conflate grammatically related words).

2. Unicode61 Tokenizer and Diacritic Removal

The Unicode61 tokenizer splits text at Unicode whitespace and punctuation, but the remove_diacritics 2 setting complicates Polish-language processing:

Diacritics are integral to Polish spelling and meaning. Removing them turns "pięciu" into "pieciu," which is not a valid Polish word.
Queries for "piecy*" (without diacritics) fail to match "Piecyka" (with diacritics) because the indexed token becomes "piecyk" after diacritic removal, while the query assumes an exact prefix match.

3. Tokenizer Syntax Misinterpretation

The tokenize="porter unicode61 remove_diacritics 2" parameter is syntactically invalid. FTS5 expects a single tokenizer name followed by optional arguments. Here, SQLite interprets the entire string as a tokenizer named "porter," with "unicode61," "remove_diacritics," and "2" treated as arguments—which the Porter tokenizer ignores. Consequently, the table defaults to the Porter tokenizer (which uses the simple tokenizer internally) instead of the intended Unicode61 rules. This explains why "piecy*" matches "pięciu": the simple tokenizer splits words at non-alphanumeric characters, and the Porter stemmer aggressively truncates both terms to "piec."

Resolution: Reconfiguring FTS5 for Polish-Language Tokenization

Step 1: Validate Tokenizer Configuration

First, correct the tokenizer syntax. The porter and unicode61 tokenizers cannot be combined directly. To use Unicode61 with diacritic removal, specify:

CREATE VIRTUAL TABLE teksty USING FTS5(
  id UNINDEXED, 
  tytul, 
  alt_tytul,
  rok,
  tekst,
  muzyka,
  slowa,
  pochodzenie,
  scena,
  gatunek, 
  tokenize="unicode61 remove_diacritics 2"
);

This ensures proper Unicode61 tokenization without Porter stemming.

Step 2: Disable Porter Stemming for Polish Text

Remove the Porter stemmer entirely. For Polish, stemming requires language-specific rules unavailable in SQLite’s built-in tokenizers. Instead, rely on prefix searches (*) and phrase matching. For example:

-- Search for prefixes without stemming
SELECT * FROM teksty WHERE tekst MATCH 'piecyk*';

Step 3: Handle Diacritics Explicitly

If diacritic removal is necessary, preprocess both the indexed text and queries to normalize characters. Use SQLite’s REPLACE() function or application-layer logic to map accented characters to their ASCII equivalents (e.g., "ę" → "e"). However, this risks conflating distinct words (e.g., "łaska" vs. "laska").

Step 4: Implement Custom Tokenization (Advanced)

For advanced stemming, create a custom FTS5 tokenizer using SQLite’s C API. Integrate a Polish stemmer like Snowball or Morfeusz. Example steps:

Compile SQLite with a loadable extension that implements the Polish stemmer.
Define a custom tokenizer that combines Unicode61 segmentation and Polish stemming:

CREATE VIRTUAL TABLE teksty USING FTS5(
  ...,
  tokenize="polish_unicode61"
);

Step 5: Verify Tokenization with `fts5vocab`

Use the fts5vocab virtual table to inspect indexed tokens:

-- List tokens in the 'tekst' column
SELECT * FROM fts5vocab('teksty', 'row') WHERE term MATCH 'piec*';

This reveals whether "pięciu" and "Piecyka" are tokenized as expected.

Step 6: Optimize Queries for Unicode61

With diacritics removed, adjust queries to account for normalized spelling:

-- Match "Piecyka" (normalized to "piecyk")
SELECT * FROM teksty WHERE tekst MATCH 'piecyk*';

Step 7: Consider Alternative Tokenizer Libraries

Explore third-party tokenizers like SQLite FTS5 Stemmers, which support multiple languages. These require compiling SQLite with additional dependencies but provide more robust stemming for non-English text.

By addressing the Porter stemmer’s incompatibility, reconfiguring the tokenizer, and validating indexed tokens, users can achieve consistent search results for Polish-language content in FTS5. For languages with complex morphology, custom tokenizers or application-layer preprocessing are often necessary to supplement SQLite’s built-in capabilities.

FTS5 Porter Stemmer and Unicode61 Tokenizer Conflicts in Non-English Text Searches

Issue Overview: Inconsistent Search Results with FTS5 Porter-Unicode61 Tokenizer in Polish Text

Root Causes: Porter Stemmer Limitations and Tokenizer Misconfiguration

1. Porter Stemmer’s Incompatibility with Polish Morphology

2. Unicode61 Tokenizer and Diacritic Removal

3. Tokenizer Syntax Misinterpretation

Resolution: Reconfiguring FTS5 for Polish-Language Tokenization

Step 1: Validate Tokenizer Configuration

Step 2: Disable Porter Stemming for Polish Text

Step 3: Handle Diacritics Explicitly

Step 4: Implement Custom Tokenization (Advanced)

Step 5: Verify Tokenization with `fts5vocab`

Step 6: Optimize Queries for Unicode61

Step 7: Consider Alternative Tokenizer Libraries

Handling Non-ISO Date Formats in SQLite with Custom Datetime Modifiers

FTS5 Greek Diacritic Insensitivity in Full-Text Search Queries

Handling SQLite Aggregate Functions in Tcl When Tables Are Empty

DATETIME Division by Float in SQLite: Troubleshooting and Solutions

SQLite Left Outer Join on Pragma Tables Behaves as Inner Join

group_concat() NULL Handling with rowid in SQLite Left Joins

Leave a Reply Cancel reply

Issue Overview: Inconsistent Search Results with FTS5 Porter-Unicode61 Tokenizer in Polish Text

Root Causes: Porter Stemmer Limitations and Tokenizer Misconfiguration

1. Porter Stemmer’s Incompatibility with Polish Morphology

2. Unicode61 Tokenizer and Diacritic Removal

3. Tokenizer Syntax Misinterpretation

Resolution: Reconfiguring FTS5 for Polish-Language Tokenization

Step 1: Validate Tokenizer Configuration

Step 2: Disable Porter Stemming for Polish Text

Step 3: Handle Diacritics Explicitly

Step 4: Implement Custom Tokenization (Advanced)

Step 5: Verify Tokenization with fts5vocab

Step 6: Optimize Queries for Unicode61

Step 7: Consider Alternative Tokenizer Libraries

Related Guides

Leave a Reply Cancel reply

Step 5: Verify Tokenization with `fts5vocab`