Enabling Custom Tokenizers to Support LIKE and GLOB Patterns in SQLite FTS5
The Current Limitation of FTS5 Tokenizers in Supporting LIKE and GLOB Patterns
The Full-Text Search version 5 (FTS5) extension in SQLite is a powerful tool for performing advanced text searches. One of its key features is the ability to use tokenizers, which break down text into smaller units (tokens) for indexing and querying. However, a significant limitation exists in the current implementation: only the built-in trigram
tokenizer supports the LIKE
and GLOB
pattern-matching operators. This restriction is hardcoded into the FTS5 source code, specifically in the fts5_tokenize.c
file, as shown in the following snippet:
int sqlite3Fts5TokenizerPattern(
int (*xCreate)(void*, const char**, int, Fts5Tokenizer**),
Fts5Tokenizer *pTok
){
if( xCreate==fts5TriCreate ){
TrigramTokenizer *p = (TrigramTokenizer*)pTok;
if( p->iFoldParam==0 ){
return p->bFold ? FTS5_PATTERN_LIKE : FTS5_PATTERN_GLOB;
}
}
return FTS5_PATTERN_NONE;
}
This code checks whether the tokenizer being used is the trigram
tokenizer (fts5TriCreate
). If it is, the function returns either FTS5_PATTERN_LIKE
or FTS5_PATTERN_GLOB
, depending on the configuration of the tokenizer. For all other tokenizers, the function returns FTS5_PATTERN_NONE
, effectively disabling LIKE
and GLOB
support.
This limitation poses a problem for developers who wish to implement custom tokenizers that are capable of supporting LIKE
and GLOB
patterns. For example, a developer might want to create a custom tokenizer that handles text slightly differently from the trigram
tokenizer but still retains the ability to support these pattern-matching operators. Currently, there is no straightforward way to achieve this without modifying and recompiling the FTS5 source code.
The inability to extend this functionality to custom tokenizers restricts the flexibility of FTS5 and forces developers to either stick with the trigram
tokenizer or forgo the use of LIKE
and GLOB
patterns altogether. This limitation is particularly problematic in scenarios where custom tokenization logic is required, but the ability to perform pattern-matching queries is also essential.
Why Custom Tokenizers Cannot Currently Support LIKE and GLOB Patterns
The core issue lies in the way FTS5 is designed to handle tokenizers and pattern-matching operators. The current implementation assumes that only the trigram
tokenizer is capable of supporting LIKE
and GLOB
patterns. This assumption is hardcoded into the sqlite3Fts5TokenizerPattern
function, which explicitly checks for the trigram
tokenizer and returns FTS5_PATTERN_NONE
for all other tokenizers.
One possible reason for this design decision is that the trigram
tokenizer is specifically optimized for pattern-matching operations. It breaks text into three-character sequences (trigrams), which makes it well-suited for LIKE
and GLOB
queries. Other tokenizers, such as the unicode61
tokenizer, may not inherently support pattern-matching in the same way, as they are designed for different use cases, such as handling Unicode text or performing basic word-based tokenization.
However, this design does not account for the possibility that a custom tokenizer might be capable of supporting LIKE
and GLOB
patterns, even if it does not use the same trigram-based approach. For example, a custom tokenizer might implement its own logic for breaking text into tokens that are compatible with pattern-matching queries. In such cases, the current implementation of FTS5 provides no mechanism for the tokenizer to indicate that it supports these patterns.
Another factor contributing to this limitation is the lack of an extension point in the FTS5Tokenizer
interface that would allow custom tokenizers to declare their support for LIKE
and GLOB
patterns. The sqlite3Fts5TokenizerPattern
function is the only place in the FTS5 codebase where this capability is checked, and it does not provide a way for custom tokenizers to "opt in" to this behavior.
This design oversight effectively locks out custom tokenizers from supporting LIKE
and GLOB
patterns, even if they are technically capable of doing so. As a result, developers are left with no choice but to either modify the FTS5 source code or abandon their custom tokenization logic in favor of the trigram
tokenizer.
How to Enable Custom Tokenizers to Support LIKE and GLOB Patterns
To address this limitation, the FTS5 extension should be modified to allow custom tokenizers to indicate their support for LIKE
and GLOB
patterns. This can be achieved by introducing a new extension point in the FTS5Tokenizer
interface that enables tokenizers to declare their pattern-matching capabilities.
One possible approach is to add a new function to the FTS5Tokenizer
interface, such as xPatternSupport
, which would allow a tokenizer to specify whether it supports LIKE
, GLOB
, or neither. This function could be called by the sqlite3Fts5TokenizerPattern
function to determine the pattern-matching capabilities of the tokenizer. For example:
int sqlite3Fts5TokenizerPattern(
int (*xCreate)(void*, const char**, int, Fts5Tokenizer**),
Fts5Tokenizer *pTok
){
if( xCreate==fts5TriCreate ){
TrigramTokenizer *p = (TrigramTokenizer*)pTok;
if( p->iFoldParam==0 ){
return p->bFold ? FTS5_PATTERN_LIKE : FTS5_PATTERN_GLOB;
}
}
if( pTok->xPatternSupport ){
return pTok->xPatternSupport(pTok);
}
return FTS5_PATTERN_NONE;
}
In this modified version of the sqlite3Fts5TokenizerPattern
function, if the tokenizer provides an xPatternSupport
function, it is called to determine the pattern-matching capabilities of the tokenizer. This allows custom tokenizers to declare their support for LIKE
and GLOB
patterns without requiring changes to the FTS5 source code.
To implement this functionality in a custom tokenizer, the developer would need to define the xPatternSupport
function and return the appropriate value (FTS5_PATTERN_LIKE
, FTS5_PATTERN_GLOB
, or FTS5_PATTERN_NONE
) based on the tokenizer’s capabilities. For example:
int myTokenizerPatternSupport(Fts5Tokenizer *pTok){
// Custom logic to determine pattern support
if( /* tokenizer supports LIKE */ ){
return FTS5_PATTERN_LIKE;
} else if( /* tokenizer supports GLOB */ ){
return FTS5_PATTERN_GLOB;
} else {
return FTS5_PATTERN_NONE;
}
}
This approach provides a flexible and extensible way for custom tokenizers to support LIKE
and GLOB
patterns, without requiring changes to the core FTS5 codebase. It also maintains backward compatibility, as existing tokenizers that do not provide the xPatternSupport
function will continue to work as before.
In addition to this, the SQLite documentation should be updated to include guidelines for implementing custom tokenizers that support LIKE
and GLOB
patterns. This would help developers understand how to take advantage of this new functionality and ensure that their custom tokenizers are compatible with FTS5’s pattern-matching features.
By making these changes, SQLite can provide developers with the flexibility they need to implement custom tokenizers that fully leverage the power of FTS5, including support for LIKE
and GLOB
patterns. This would significantly enhance the usability of FTS5 in scenarios where custom tokenization logic is required, while still maintaining the performance and reliability that SQLite is known for.
This post provides a detailed analysis of the issue, its underlying causes, and a proposed solution for enabling custom tokenizers to support LIKE
and GLOB
patterns in SQLite FTS5. By addressing this limitation, SQLite can become an even more powerful tool for developers who need advanced text search capabilities in their applications.