Enforcing Exact Word Distance Constraints in SQLite FTS5 Queries


Understanding the Limitations of NEAR in Enforcing Exact Word Distances

SQLite’s FTS5 module provides powerful full-text search capabilities, but its NEAR operator has inherent limitations when users require exact word distance constraints. The NEAR operator allows proximity searches by specifying a maximum allowed distance between two or more terms. For example, NEAR(word1 word2, 10) matches documents where word1 and word2 appear within 10 tokens of each other. However, this operator does not enforce a minimum distance requirement or guarantee that terms are separated by an exact number of tokens.

The Problem of "False Positives" with NEAR

Consider a scenario where a user wants to find documents where word1 and word2 are exactly 7 tokens apart. Using NEAR(word1 word2, 7) will include all documents where the words are 1-7 tokens apart, creating "false positives." To filter out matches where the distance is less than 7, a user might try combining NEAR with NOT:

NEAR(word1 word2, 7) NOT NEAR(word1 word2, 6)

This approach subtracts matches where the distance is ≤6. However, it has critical flaws:

  1. Exclusion of Valid Documents: If a document contains both a valid match (distance=7) and an invalid match (distance≤6), the entire document is excluded.
  2. Performance Overhead: The query must process two proximity conditions, doubling computational effort.
  3. Unintuitive Behavior: Users unfamiliar with FTS5’s set-based logic may misinterpret the results.

A real-world example from the SQLite forum demonstrates this:

SELECT * FROM docs WHERE docs MATCH 'NEAR("row" "contains", 1) NOT NEAR("row" "contains", 0)'  

This query attempts to find documents where "row" and "contains" are exactly 1 token apart. However, a page containing both "row that contains" (distance=2) and "row contains" (distance=0) is excluded entirely, even though the former is a valid match.

The Absence of an EXACT Operator

FTS5 lacks a dedicated operator like EXACT(word1 word2, N) to enforce precise distances. This gap forces users to rely on workarounds or auxiliary tools. The discussion highlights a fundamental tension in FTS5’s design: balancing flexibility with precision. While NEAR is sufficient for broad proximity searches, it fails in edge cases requiring mathematical exactness.


Factors Contributing to Inadequate Precision in FTS5 Proximity Searches

1. Design Philosophy of FTS5

FTS5 prioritizes speed and simplicity over granular control. The NEAR operator is optimized for fast lookups using inverted indexes, which track term positions but do not natively support arithmetic operations on token distances. Introducing an EXACT operator would require:

  • Storing absolute token positions for all terms.
  • Computing pairwise distances during query execution.
  • Filtering results dynamically, which degrades performance on large datasets.

2. Backward Compatibility Concerns

Adding new syntax (e.g., EXACT) risks breaking existing applications that use the same keyword as a literal term. For example, a legal document database containing the phrase "EXACT terms of agreement" would misinterpret queries if EXACT became a reserved operator. The SQLite team prioritizes backward compatibility, making syntactic changes unlikely unless overwhelmingly justified.

3. Tokenization and Indexing Constraints

FTS5’s tokenization process splits text into tokens based on rules defined by the chosen tokenizer (e.g., unicode61). The index stores token positions but does not expose them directly to users. Auxiliary functions can access positional data, but this requires custom development. Most users rely on FTS5’s built-in operators, which abstract away low-level details at the cost of flexibility.

4. Use Case Prevalence

The need for exact distances is niche compared to broader proximity searches. Most applications (e.g., search engines, document retrieval) prioritize recall over precision. Until demand for exact matching grows, SQLite is unlikely to prioritize native support.


Implementing Workarounds and Custom Solutions for Exact Distance Queries

Solution 1: Custom Auxiliary Functions

FTS5 allows developers to create auxiliary functions that extend query capabilities. These functions can access token positions and compute distances.

Step-by-Step Implementation

  1. Define a Function to Calculate Distance:
    Using SQLite’s C API or a scripting language (e.g., Python), write a function that:

    • Accepts a column name and search terms.
    • Retrieves token positions from the FTS5 index.
    • Computes pairwise distances between terms.

    Example in Python (using sqlite3 module):

    import sqlite3
    
    def word_distance(column, term1, term2):
        conn = sqlite3.connect(':memory:')
        cursor = conn.cursor()
        cursor.execute(f"SELECT positions FROM {column}_stats WHERE term IN (?, ?)", (term1, term2))
        positions = cursor.fetchall()
        # Compute distances from positions data
        return distances
    
  2. Integrate the Function into Queries:
    Use the auxiliary function in FTS5 queries to filter results:

    SELECT * FROM docs 
    WHERE docs MATCH 'NEAR(word1 word2, 10)' 
    AND word_distance(docs, 'word1', 'word2') = 7;
    
  3. Optimize Performance:

    • Precompute token positions during indexing.
    • Use memoization to cache distances for common term pairs.

Limitations

  • Complexity: Requires low-level programming and integration with SQLite.
  • Performance: May slow down queries on large datasets.

Solution 2: Combining NEAR and NOT NEAR

While imperfect, this method works for simple cases:

SELECT * FROM docs 
WHERE docs MATCH 'NEAR(word1 word2, 10) NOT NEAR(word1 word2, 9)';

This excludes documents where the distance is ≤9, leaving only those with distances ≥10. Adjust the numbers to target specific ranges.

Caveats

  • False Negatives: Documents containing both valid and invalid matches are excluded.
  • Static Thresholds: Requires manual adjustment for each query.

Solution 3: Custom Syntax Hacks via Column Filters

FTS5’s column filter feature allows users to intercept and modify queries. By "hijacking" undefined column names, you can create pseudo-operators.

Example: Emulating EXACT with a Column Filter

  1. Define a Filter:
    CREATE VIRTUAL TABLE docs USING fts5(content, matchlike='');
    
  2. Intercept Queries:
    Use a filter to parse custom syntax like matchlike:%ing and translate it into FTS5-compatible terms.
  3. Encode Distances:
    Map EXACT(word1 word2, 7) to a token like exact_distance_7 and use a synonym mapping to enforce the constraint.

Limitations

  • Fragility: Relies on undocumented behaviors that may change in future SQLite versions.
  • Maintenance: Requires ongoing updates to handle new syntax.

Solution 4: Custom Ranking with Auxiliary Functions

Modify the ranking algorithm to prioritize exact matches:

SELECT * FROM docs 
WHERE docs MATCH 'word1 word2' 
ORDER BY exact_distance_score(docs, 'word1', 'word2', 7) DESC;

Here, exact_distance_score returns a higher value for documents where word1 and word2 are exactly 7 tokens apart.

Solution 5: Token Position Queries

For advanced users, directly query token positions using FTS5’s hidden __docs__ table:

SELECT * FROM docs 
WHERE EXISTS (
    SELECT 1 FROM docs_terms 
    WHERE term = 'word1' 
    AND ABS(pos - (SELECT pos FROM docs_terms WHERE term = 'word2')) = 7
);

This requires enabling the contentless option and storing positional data explicitly.


Final Recommendations

  1. Use Auxiliary Functions for precise control and scalability.
  2. Combine NEAR/NOT NEAR for quick, ad-hoc queries where false negatives are acceptable.
  3. Advocate for Native Support by engaging the SQLite community and demonstrating use cases.

By leveraging SQLite’s extensibility, users can overcome FTS5’s limitations while awaiting potential future enhancements.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *