Ensuring FTS5 Matches Only Contain Specified Search Tokens in SQLite-WASM

Issue Overview: FTS5 Matches Include Rows with Extraneous Tokens Beyond Search Terms

The core challenge arises when using SQLite’s Full-Text Search (FTS5) module to query a virtual table (e.g., user_fts) containing concatenated user data like "Firstname Lastname." The goal is to retrieve rows where at least one search token exists (e.g., "John" OR "Smith") while excluding rows containing tokens not present in the search criteria. For example:

  • Valid Match: "Smith" (contains one search token)
  • Invalid Match: "Johanna Smith" (contains "Smith" but introduces an extraneous token "Johanna")

This requires enforcing two conditions simultaneously:

  1. Inclusion: At least one search token exists in the row
  2. Exclusion: No tokens exist in the row outside the search terms

Standard FTS5 MATCH queries with OR logic satisfy the inclusion requirement but fail to address exclusion. SQLite-WASM builds further complicate this by lacking native regex support or unclear support for custom auxiliary functions, which could otherwise validate token boundaries.

Possible Causes: FTS5 Token Matching Mechanics and Environmental Constraints

1. FTS5 Tokenizer Behavior and OR Queries

FTS5 tokenizes text using rules defined by its tokenizer (typically Unicode61 by default). When using MATCH 'John OR Smith', FTS5 returns rows containing any of the tokens. This creates false positives when rows include additional tokens not in the search phrase. For example, "Johanna Smith" contains "Smith" but also "Johanna," which violates the exclusion requirement.

2. Absence of Regex Support in SQLite-WASM

Regex could theoretically filter out rows with extraneous tokens by validating the entire text against a pattern like ^(John|Smith)*$ (allowing only search tokens). However, SQLite-WASM lacks REGEXP support unless explicitly enabled via loadable extensions, which are often unavailable in browser-based environments.

3. Ambiguity in Custom Auxiliary Function Support

FTS5’s custom auxiliary functions enable advanced match validation by inspecting token positions and counts during queries. However, SQLite-WASM builds may not expose the necessary C APIs (e.g., sqlite3_create_function_v2) for defining these functions, particularly when handling string parameters or complex data structures.

Troubleshooting Steps, Solutions & Fixes: Leveraging FTS5 Internals and Workarounds

Step 1: Confirm Custom Auxiliary Function Availability in SQLite-WASM

Before implementing complex workarounds, verify whether the SQLite-WASM build supports custom auxiliary functions. Execute a test query using the highlight() built-in auxiliary function:

SELECT highlight(user_fts, 0, '[', ']') FROM user_fts WHERE user_fts MATCH 'John OR Smith';

If this works, custom functions are likely supported. If not, proceed to Step 3 for environment-agnostic solutions.


Step 2: Implement Token Validation via Custom Auxiliary Function (If Supported)

If auxiliary functions are available, create a function MyRank to enforce the exclusion requirement using FTS5’s internal token metadata.

How MyRank Works:

  1. Count Total Tokens: Use xColumnSize to retrieve the total token count for the current row.
  2. Identify Matched Token Offsets: Use xInstCount and xInst to get the positions of tokens matching the search query.
  3. Validate Token Coverage: Ensure every token position (0 to totalTokens - 1) is covered by matched offsets. If true, return 1; else, return 0.

Example Implementation (Pseudocode):

#include <sqlite3.h>
#include <sqlite3ext.h>

static void myRankFunc(
  sqlite3_context *pCtx, 
  int nVal, 
  sqlite3_value **apVal
){
  sqlite3_int64 totalTokens = 0;
  sqlite3_int64 matchedTokenCount = 0;
  int rc;

  // Get total tokens in the current row's first column (iCol=0)
  rc = sqlite3_vtab_column_size(pCtx, 0, &totalTokens);
  if (rc != SQLITE_OK) { /* Handle error */ }

  // Get count of matched token instances
  rc = sqlite3_vtab_inst_count(pCtx, &matchedTokenCount);
  if (rc != SQLITE_OK) { /* Handle error */ }

  // Check if all tokens are matched
  int isValid = (matchedTokenCount == totalTokens) ? 1 : 0;
  sqlite3_result_int(pCtx, isValid);
}

// Register function during FTS5 module initialization
sqlite3_create_function_v2(
  db, 
  "MyRank", 
  1, 
  SQLITE_UTF8, 
  nullptr, 
  myRankFunc, 
  nullptr, 
  nullptr
);

Final Query:

SELECT rowid 
FROM user_fts 
WHERE 
  user_fts MATCH 'John OR Smith' 
  AND MyRank(user_fts) = 1;

Step 3: Environment-Agnostic Solutions Without Custom Functions

If custom functions are unavailable, use FTS5’s built-in capabilities and SQL logic to approximate the exclusion requirement.

Solution 3A: Exploit Phrase Matching and Column Constraints

Force exact phrase matches for multi-token searches while allowing single-token flexibility:

-- Match "John Smith" as exact phrase
SELECT rowid FROM user_fts WHERE user_fts MATCH '"John Smith"'

UNION

-- Match individual tokens but enforce single-token rows
SELECT rowid FROM user_fts 
WHERE 
  (user_fts MATCH 'John' OR user_fts MATCH 'Smith')
  AND length(trim([column])) - length(replace(trim([column]), ' ', '')) = 0;

Limitations: Assumes single-column FTS tables and uses string manipulation to count spaces (not reliable for all tokenizers).

Solution 3B: Precompute Token Counts and Validate Post-Query

  1. Add a token_count column to the source table, storing the number of tokens in "Firstname Lastname."
  2. Use FTS5’s content option to sync with the source table.
  3. Combine FTS matches with token count validation:
-- For search terms "John Smith" (2 tokens)
SELECT fts.rowid 
FROM user_fts fts
JOIN source_table src ON fts.rowid = src.rowid
WHERE 
  fts.user_fts MATCH 'John OR Smith'
  AND src.token_count <= 2; -- Ensure row doesn’t exceed search term count

Limitations: Fails if rows contain fewer tokens than search terms but include invalid tokens (e.g., "John Doe Smith" with search terms "John Smith").

Solution 3C: Hybrid FTS5 and JavaScript Token Validation

In a WebAssembly context, retrieve FTS matches client-side and validate tokens using JavaScript:

// Execute initial FTS query
const rows = await db.exec({
  sql: 'SELECT rowid, * FROM user_fts WHERE user_fts MATCH ?',
  bind: ['John OR Smith'],
});

// Validate tokens client-side
const validRows = rows.filter(row => {
  const tokens = row.columnName.split(/\s+/); // Adjust for FTS5 tokenizer
  const searchTokens = new Set(['john', 'smith']);
  return tokens.every(t => searchTokens.has(t.toLowerCase()));
});

Limitations: Requires transferring all FTS matches to the client, which may be inefficient for large datasets.


Final Recommendations

  1. Prioritize Custom Auxiliary Functions if supported in SQLite-WASM, as they provide the most robust solution.
  2. Use Hybrid Client-Side Validation as a fallback for environments with restricted SQL capabilities.
  3. Precompute Token Metadata (e.g., counts, allowed tokens) during data ingestion to simplify runtime queries.

By combining FTS5’s native strengths with strategic validations—either server-side via auxiliary functions or client-side via post-processing—developers can enforce strict token inclusion/exclusion rules despite SQLite-WASM’s constraints.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *