Enforcing Exact Word Distance Constraints in SQLite FTS5 Queries
Understanding the Limitations of NEAR in Enforcing Exact Word Distances
SQLite’s FTS5 module provides powerful full-text search capabilities, but its NEAR operator has inherent limitations when users require exact word distance constraints. The NEAR operator allows proximity searches by specifying a maximum allowed distance between two or more terms. For example, NEAR(word1 word2, 10)
matches documents where word1
and word2
appear within 10 tokens of each other. However, this operator does not enforce a minimum distance requirement or guarantee that terms are separated by an exact number of tokens.
The Problem of "False Positives" with NEAR
Consider a scenario where a user wants to find documents where word1
and word2
are exactly 7 tokens apart. Using NEAR(word1 word2, 7)
will include all documents where the words are 1-7 tokens apart, creating "false positives." To filter out matches where the distance is less than 7, a user might try combining NEAR with NOT:
NEAR(word1 word2, 7) NOT NEAR(word1 word2, 6)
This approach subtracts matches where the distance is ≤6. However, it has critical flaws:
- Exclusion of Valid Documents: If a document contains both a valid match (distance=7) and an invalid match (distance≤6), the entire document is excluded.
- Performance Overhead: The query must process two proximity conditions, doubling computational effort.
- Unintuitive Behavior: Users unfamiliar with FTS5’s set-based logic may misinterpret the results.
A real-world example from the SQLite forum demonstrates this:
SELECT * FROM docs WHERE docs MATCH 'NEAR("row" "contains", 1) NOT NEAR("row" "contains", 0)'
This query attempts to find documents where "row" and "contains" are exactly 1 token apart. However, a page containing both "row that contains" (distance=2) and "row contains" (distance=0) is excluded entirely, even though the former is a valid match.
The Absence of an EXACT Operator
FTS5 lacks a dedicated operator like EXACT(word1 word2, N)
to enforce precise distances. This gap forces users to rely on workarounds or auxiliary tools. The discussion highlights a fundamental tension in FTS5’s design: balancing flexibility with precision. While NEAR is sufficient for broad proximity searches, it fails in edge cases requiring mathematical exactness.
Factors Contributing to Inadequate Precision in FTS5 Proximity Searches
1. Design Philosophy of FTS5
FTS5 prioritizes speed and simplicity over granular control. The NEAR operator is optimized for fast lookups using inverted indexes, which track term positions but do not natively support arithmetic operations on token distances. Introducing an EXACT operator would require:
- Storing absolute token positions for all terms.
- Computing pairwise distances during query execution.
- Filtering results dynamically, which degrades performance on large datasets.
2. Backward Compatibility Concerns
Adding new syntax (e.g., EXACT
) risks breaking existing applications that use the same keyword as a literal term. For example, a legal document database containing the phrase "EXACT terms of agreement" would misinterpret queries if EXACT
became a reserved operator. The SQLite team prioritizes backward compatibility, making syntactic changes unlikely unless overwhelmingly justified.
3. Tokenization and Indexing Constraints
FTS5’s tokenization process splits text into tokens based on rules defined by the chosen tokenizer (e.g., unicode61
). The index stores token positions but does not expose them directly to users. Auxiliary functions can access positional data, but this requires custom development. Most users rely on FTS5’s built-in operators, which abstract away low-level details at the cost of flexibility.
4. Use Case Prevalence
The need for exact distances is niche compared to broader proximity searches. Most applications (e.g., search engines, document retrieval) prioritize recall over precision. Until demand for exact matching grows, SQLite is unlikely to prioritize native support.
Implementing Workarounds and Custom Solutions for Exact Distance Queries
Solution 1: Custom Auxiliary Functions
FTS5 allows developers to create auxiliary functions that extend query capabilities. These functions can access token positions and compute distances.
Step-by-Step Implementation
Define a Function to Calculate Distance:
Using SQLite’s C API or a scripting language (e.g., Python), write a function that:- Accepts a column name and search terms.
- Retrieves token positions from the FTS5 index.
- Computes pairwise distances between terms.
Example in Python (using
sqlite3
module):import sqlite3 def word_distance(column, term1, term2): conn = sqlite3.connect(':memory:') cursor = conn.cursor() cursor.execute(f"SELECT positions FROM {column}_stats WHERE term IN (?, ?)", (term1, term2)) positions = cursor.fetchall() # Compute distances from positions data return distances
Integrate the Function into Queries:
Use the auxiliary function in FTS5 queries to filter results:SELECT * FROM docs WHERE docs MATCH 'NEAR(word1 word2, 10)' AND word_distance(docs, 'word1', 'word2') = 7;
Optimize Performance:
- Precompute token positions during indexing.
- Use memoization to cache distances for common term pairs.
Limitations
- Complexity: Requires low-level programming and integration with SQLite.
- Performance: May slow down queries on large datasets.
Solution 2: Combining NEAR and NOT NEAR
While imperfect, this method works for simple cases:
SELECT * FROM docs
WHERE docs MATCH 'NEAR(word1 word2, 10) NOT NEAR(word1 word2, 9)';
This excludes documents where the distance is ≤9, leaving only those with distances ≥10. Adjust the numbers to target specific ranges.
Caveats
- False Negatives: Documents containing both valid and invalid matches are excluded.
- Static Thresholds: Requires manual adjustment for each query.
Solution 3: Custom Syntax Hacks via Column Filters
FTS5’s column filter feature allows users to intercept and modify queries. By "hijacking" undefined column names, you can create pseudo-operators.
Example: Emulating EXACT with a Column Filter
- Define a Filter:
CREATE VIRTUAL TABLE docs USING fts5(content, matchlike='');
- Intercept Queries:
Use a filter to parse custom syntax likematchlike:%ing
and translate it into FTS5-compatible terms. - Encode Distances:
MapEXACT(word1 word2, 7)
to a token likeexact_distance_7
and use a synonym mapping to enforce the constraint.
Limitations
- Fragility: Relies on undocumented behaviors that may change in future SQLite versions.
- Maintenance: Requires ongoing updates to handle new syntax.
Solution 4: Custom Ranking with Auxiliary Functions
Modify the ranking algorithm to prioritize exact matches:
SELECT * FROM docs
WHERE docs MATCH 'word1 word2'
ORDER BY exact_distance_score(docs, 'word1', 'word2', 7) DESC;
Here, exact_distance_score
returns a higher value for documents where word1
and word2
are exactly 7 tokens apart.
Solution 5: Token Position Queries
For advanced users, directly query token positions using FTS5’s hidden __docs__
table:
SELECT * FROM docs
WHERE EXISTS (
SELECT 1 FROM docs_terms
WHERE term = 'word1'
AND ABS(pos - (SELECT pos FROM docs_terms WHERE term = 'word2')) = 7
);
This requires enabling the contentless
option and storing positional data explicitly.
Final Recommendations
- Use Auxiliary Functions for precise control and scalability.
- Combine NEAR/NOT NEAR for quick, ad-hoc queries where false negatives are acceptable.
- Advocate for Native Support by engaging the SQLite community and demonstrating use cases.
By leveraging SQLite’s extensibility, users can overcome FTS5’s limitations while awaiting potential future enhancements.