Implementing Trigram-Like Search in SQLite Using FTS5 and Custom Tokenizers
Substring Search Challenges in SQLite Without Native Trigram Support
SQLite, while being a lightweight and powerful database engine, lacks native support for trigram indexes, a feature commonly found in databases like PostgreSQL. Trigram indexes are particularly useful for speeding up substring searches, such as those performed using the LIKE
operator with wildcards on both ends (e.g., %smil%
). These searches are inherently slow in SQLite because they require a full table scan, as no index can be used to optimize them. This limitation becomes particularly problematic when dealing with large datasets where performance is critical.
The core issue is that SQLite’s Full-Text Search (FTS) modules, such as FTS3 and FTS5, are designed to handle prefix or exact token matches rather than arbitrary substring searches. For example, FTS5 can efficiently handle queries like column MATCH 'cot*'
but struggles with column LIKE '%cot%'
. This discrepancy arises because FTS5 tokenizes text into words or phrases and indexes them, but it does not inherently support indexing overlapping substrings or trigrams (three-character sequences) within words.
Custom Tokenizers and FTS5 as a Workaround for Trigram-Like Functionality
One possible solution to this limitation is to leverage SQLite’s extensibility by creating a custom tokenizer for FTS5 that emulates trigram behavior. A custom tokenizer can break text into overlapping three-character sequences, effectively creating a trigram index within the FTS5 framework. This approach allows SQLite to perform substring searches more efficiently by leveraging the FTS5 index.
The custom tokenizer works by scanning the input text and generating tokens for every possible three-character sequence. For example, the word "simon" would be tokenized into "sim", "imo", and "mon". These tokens are then indexed by FTS5, enabling fast lookups for any substring that matches one of these trigrams. While this method does not replicate PostgreSQL’s trigram indexes exactly, it provides a similar level of functionality and performance improvement for substring searches.
However, this approach has some limitations. First, it requires additional storage space to store the trigram tokens and their associated indexes. Second, the custom tokenizer must be implemented in C and compiled into an SQLite extension, which adds complexity to the deployment process. Finally, the tokenizer does not handle overlapping matches perfectly, as it only indexes fixed-length sequences and does not account for variable-length substrings.
Building and Using a Custom Trigram Tokenizer for FTS5 in SQLite
To implement a trigram-like search in SQLite, follow these steps:
Step 1: Compile the Custom Tokenizer Extension
The first step is to compile the custom tokenizer into an SQLite extension. The tokenizer code, written in C, must be compiled into a shared library (e.g., ftstri.so
on Linux or ftstri.dylib
on macOS). Below is the C code for the custom tokenizer:
#include "sqlite3ext.h"
SQLITE_EXTENSION_INIT1
#include <assert.h>
#include <string.h>
#include <fts5.h>
static fts5_api *fts5_api_from_db(sqlite3 *db) {
fts5_api *pRet = 0;
sqlite3_stmt *pStmt = 0;
if (SQLITE_OK == sqlite3_prepare(db, "SELECT fts5(?1)", -1, &pStmt, 0)) {
sqlite3_bind_pointer(pStmt, 1, (void*)&pRet, "fts5_api_ptr", NULL);
sqlite3_step(pStmt);
}
sqlite3_finalize(pStmt);
return pRet;
}
static int dummy = 0;
static int ftsTriCreate(
void *pCtx,
const char **azArg,
int nArg,
Fts5Tokenizer **ppOut
) {
*ppOut = (Fts5Tokenizer*)&dummy;
return SQLITE_OK;
}
static void ftsTriDelete(Fts5Tokenizer *p) {
assert(p == (Fts5Tokenizer*)&dummy);
}
static int ftsTriTokenize(
Fts5Tokenizer *pUnused,
void *pCtx,
int flags,
const char *pText, int nText,
int (*xToken)(void*, int, const char*, int, int, int)
) {
int i;
for (i = 0; i < nText - 2; i++) {
int rc = xToken(pCtx, 0, &pText[i], 3, i, i + 3);
if (rc) return rc;
}
return SQLITE_OK;
}
static int ftsTriInstall(sqlite3 *db) {
fts5_api *pApi;
fts5_tokenizer tok;
tok.xCreate = ftsTriCreate;
tok.xDelete = ftsTriDelete;
tok.xTokenize = ftsTriTokenize;
pApi = fts5_api_from_db(db);
if (pApi == 0) return 0;
return pApi->xCreateTokenizer(pApi, "tri", 0, &tok, 0);
}
#ifdef _WIN32
__declspec(dllexport)
#endif
int sqlite3_ftstri_init(
sqlite3 *db,
char **pzErrMsg,
const sqlite3_api_routines *pApi
) {
SQLITE_EXTENSION_INIT2(pApi);
return ftsTriInstall(db);
}
To compile the code, use the following command:
gcc -I. -g -fPIC -shared ftstri.c -o ftstri.so
Step 2: Load the Extension and Create a Virtual Table
Once the extension is compiled, load it into SQLite and create a virtual table using the custom tokenizer:
.load ./ftstri.so
CREATE VIRTUAL TABLE dict USING fts5(word, tokenize=tri);
Step 3: Insert Data and Perform Queries
Insert data into the virtual table and perform queries using the FTS5 syntax. The custom tokenizer will enable efficient substring searches:
INSERT INTO dict VALUES ('simon');
INSERT INTO dict VALUES ('cleo');
INSERT INTO dict VALUES ('natalie');
-- Search for substrings
SELECT * FROM dict('sim'); -- Returns 'simon'
SELECT * FROM dict('imo'); -- Returns 'simon'
SELECT * FROM dict('mon'); -- Returns 'simon'
Step 4: Use Highlighting for Match Visualization
The FTS5 highlight()
function can be used to visualize matches within the text:
SELECT highlight(dict, 0, '(', ')') FROM dict('sim');
-- Returns '(sim)on'
Step 5: Optimize for Larger Datasets
For larger datasets, consider batching inserts within transactions to improve performance. Additionally, ensure that the suffix
and link
tables (as described in the alternative schema approach) are indexed properly to speed up joins and lookups.
By following these steps, you can implement a trigram-like search functionality in SQLite using FTS5 and a custom tokenizer. While this solution requires some initial setup, it provides a powerful way to perform efficient substring searches in SQLite, bridging the gap between its native capabilities and the advanced features found in other databases like PostgreSQL.