Resolving FTS5 Tokenizer Constructor Errors When Combining Tokenizers

Issue Overview: Understanding FTS5 Tokenizer Limitations and Substring Query Requirements

The core issue revolves around attempting to create a SQLite FTS5 virtual table that combines two tokenizers (unicode61 and trigram) in a single table definition, resulting in an error in tokenizer constructor. This error arises because FTS5 restricts the use of multiple tokenizers per table, except for the special case of the porter tokenizer, which acts as a wrapper around another tokenizer. The user’s goal is to enable both full-text search (using standard tokenization) and substring pattern matching (similar to LIKE or GLOB queries) within the same table. These requirements conflict because full-text search relies on tokenization to break text into discrete units (words, phrases), while substring matching necessitates preserving or generating overlapping character sequences (e.g., trigrams).

FTS5 tokenizers are responsible for splitting input text into tokens that are indexed for efficient querying. The unicode61 tokenizer is a default option that handles Unicode-aware tokenization, separating text at whitespace and punctuation while ignoring diacritics. The trigram tokenizer, if implemented, would generate overlapping three-character sequences to facilitate substring matching. However, FTS5 does not natively include a trigram tokenizer, and attempting to use one alongside unicode61 violates the single-tokenizer constraint. This limitation forces developers to seek alternative strategies for combining full-text and substring search capabilities.

The error message explicitly points to a problem in the tokenizer constructor, indicating that the FTS5 engine cannot reconcile the conflicting tokenizer specifications. The underlying challenge is that full-text search and substring matching require fundamentally different indexing approaches. Full-text search prioritizes lexical units, while substring matching depends on character-level granularity. Bridging these requirements within a single FTS5 table is not feasible using standard tokenizers, necessitating architectural workarounds.

Possible Causes: Misconfigured Tokenizer Syntax and Architectural Misalignments

  1. Invalid Tokenizer Combination: FTS5 allows only one primary tokenizer per table. The syntax tokenize = 'unicode61 trigram' incorrectly implies that both tokenizers can coexist. The porter tokenizer is the sole exception, as it modifies another tokenizer’s output (e.g., tokenize = 'porter unicode61'). Attempting to combine non-porter tokenizers triggers the constructor error.

  2. Misunderstanding Tokenizer Parameters: The unicode61 tokenizer accepts optional parameters such as remove_diacritics and separators, but these are specified as key-value pairs within parentheses (e.g., tokenize = 'unicode61 remove_diacritics 0'). The absence of parentheses in the example suggests confusion between tokenizer names and their parameters, leading the parser to interpret trigram as an invalid tokenizer.

  3. Unrecognized Tokenizer Name: While unicode61 is a built-in tokenizer, trigram is not part of the standard FTS5 distribution. If a custom trigram tokenizer exists, it must be registered using the fts5_tokenizer API before being referenced in a CREATE VIRTUAL TABLE statement. Failure to register a custom tokenizer results in the constructor error.

  4. Architectural Conflict Between Search Mfficiencies: Full-text search and substring queries optimize for different data access patterns. FTS5’s inverted index structure accelerates term lookups but does not inherently support arbitrary substring matching. Forcing both requirements into a single table creates a design mismatch that the tokenizer constructor cannot resolve.

Troubleshooting Steps, Solutions & Fixes: External Content Tables, Vocabularies, and Hybrid Approaches

Step 1: Validate Tokenizer Syntax and Registration

Ensure that the tokenizer name and parameters adhere to FTS5 syntax rules. For built-in tokenizers like unicode61, verify that parameters are correctly enclosed in parentheses:

-- Correct syntax for unicode61 with parameters
CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = 'unicode61(remove_diacritics 0)');

If using a custom trigram tokenizer, confirm that it is registered via sqlite3Fts5CreateTokenizer or equivalent API calls before table creation. Custom tokenizers require implementing the xCreate, xTokenize, and xDelete callback functions.

Step 2: Employ External Content Tables for Multiple Tokenizers

Use an external content table to store the original text, then create separate FTS5 tables with different tokenizers referencing this content. This avoids duplicating the source data while enabling both search modes:

-- Base table holding the original content
CREATE TABLE main_content (id INTEGER PRIMARY KEY, text TEXT);

-- FTS5 table for full-text search with unicode61
CREATE VIRTUAL TABLE ft_unicode USING fts5(text, content = 'main_content', content_rowid = 'id', tokenize = 'unicode61');

-- FTS5 table for trigram-based substring search
CREATE VIRTUAL TABLE ft_trigram USING fts5(text, content = 'main_content', content_rowid = 'id', tokenize = 'trigram');

Queries against ft_unicode leverage standard full-text search, while ft_trigram supports substring patterns. Inserts/updates to main_content automatically propagate to both FTS5 tables.

Step 3: Leverage FTS5 Vocabulary Tables for Substring Matching

If maintaining multiple FTS5 tables is impractical, use the fts5vocab virtual table to access tokenized terms and implement substring filtering:

-- Assuming 'ft_unicode' is the primary FTS5 table
CREATE VIRTUAL TABLE ft_vocab USING fts5vocab(ft_unicode, 'instance');

-- Query for documents containing tokens with 'substr'
SELECT DISTINCT docid FROM ft_vocab WHERE term LIKE '%substr%';

This approach retrieves documents containing tokens that include the substring, but it does not match substrings spanning multiple tokens. For broader matches, combine this with application-side processing or SQLite’s LIKE operator on the original content.

Step 4: Implement Hybrid Search with Materialized Views

Create a materialized view that combines FTS5 matches with substring conditions applied to the source text:

-- Materialized view joining FTS5 results with main content
CREATE VIEW hybrid_search AS
SELECT f.docid, m.text, f.rank
FROM ft_unicode AS f
JOIN main_content AS m ON f.rowid = m.id
WHERE m.text LIKE '%substr%';

This filters FTS5 results to those also containing the substring, leveraging indexes on main_content for efficient LIKE operations (if enabled via extensions like SQLite’s LIKE optimization).

Step 5: Develop a Custom Trigram Tokenizer

For advanced use cases, implement a custom FTS5 tokenizer that emits trigrams alongside standard tokens. This requires C programming and SQLite API integration:

#include <sqlite3.h>
#include <sqlite3ext.h>
SQLITE_EXTENSION_INIT1

typedef struct TrigramTokenizer {
  sqlite3_tokenizer base;
  // State variables for tokenization
} TrigramTokenizer;

// Implement xCreate, xTokenize, xDestroy callbacks
// Emit both unicode61-style tokens and trigrams

// Register the tokenizer
int sqlite3_extension_init(sqlite3 *db, char **pzErrMsg, const sqlite3_api_routines *pApi) {
  SQLITE_EXTENSION_INIT2(pApi);
  sqlite3Fts5CreateTokenizer(db, "trigram_hybrid", /* ... */);
  return SQLITE_OK;
}

Compile the extension and load it into SQLite, then reference it in the FTS5 table definition:

CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = 'trigram_hybrid');

This custom tokenizer can produce tokens suitable for both full-text and substring queries, though it increases complexity and may impact indexing performance.

Step 6: Optimize Substring Queries with Shadow Tables

If substring matching is performance-critical, maintain a separate shadow table storing precomputed trigrams:

CREATE TABLE text_trigrams (
  docid INTEGER,
  trigram TEXT,
  FOREIGN KEY (docid) REFERENCES main_content(id)
);

CREATE INDEX idx_trigram ON text_trigrams(trigram);

-- Populate trigrams via application logic or triggers

Querying for substrings involves decomposing the target string into trigrams and joining against this table:

SELECT DISTINCT docid FROM text_trigrams
WHERE trigram IN ('sub', 'ubs', 'bstr', ...);

This approach allows efficient substring lookups at the cost of increased storage and maintenance overhead.

Step 7: Evaluate Full-Text Search Extensions

Consider third-party SQLite extensions like sqlean or sqlite3-fts5-trigram that provide integrated trigram support. These extensions prebundle custom tokenizers and functions optimized for substring matching, reducing implementation effort.

Step 8: Adjust Application Logic to Handle Multiple Query Types

Design application-layer logic to route queries to the appropriate FTS5 table or hybrid strategy based on the presence of wildcards or specific operators. For example:

  • Use MATCH for full-text queries against ft_unicode.
  • Use LIKE/GLOB against main_content for substring patterns.
  • Combine both using INTERSECT for hybrid requirements.

Step 9: Benchmark and Optimize Indexing Performance

Assess the trade-offs between indexing speed, query latency, and storage overhead for each approach. External content tables with multiple tokenizers double the indexing workload but maintain query efficiency. Shadow tables and vocab-based queries shift computational burden to query time. Use SQLite’s EXPLAIN QUERY PLAN and profiling tools to identify bottlenecks.

Step 10: Document Architectural Decisions and Constraints

Maintain clear documentation outlining the chosen strategy, its limitations, and maintenance procedures. For example, if using external content tables, document the trigger logic for data synchronization or the process for rebuilding indexes after schema changes.

By systematically addressing tokenizer configuration errors, decoupling search requirements into distinct tables, and leveraging SQLite’s extensibility, developers can achieve both full-text and substring search capabilities without triggering constructor errors. The optimal solution depends on the specific balance of query performance, storage efficiency, and implementation complexity required by the application.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *