Exponential Insertion Time in FTS5 with Recursive CTE Triggers

Understanding the Problem: Exponential Insertion Time in FTS5 with Recursive CTE Triggers

The core issue revolves around the exponential increase in insertion time when using SQLite’s FTS5 (Full Text Search) virtual tables in conjunction with recursive Common Table Expressions (CTEs) within triggers. The scenario involves a table bdata that stores data extracted from a JSON file, with three columns (cn, pn, and sched) requiring full-text search capabilities. The goal is to enable substring searches across these columns, allowing users to filter results based on specific column inputs.

To achieve this, the user implemented a schema where:

  1. A base table bdata stores the JSON data.
  2. Three FTS5 virtual tables (cn_fts, pn_fts, and sched_fts) are created to handle full-text searches for each column.
  3. Triggers are set up to populate these FTS tables after each insertion into bdata. These triggers use recursive CTEs to generate all possible substrings for each column value.
  4. An auxiliary table fts_aux is used to filter out duplicate (id, text) pairs before inserting the final results into the FTS tables.

The problem arises when testing this setup. Insertion times increase exponentially as more triggers are added:

  • Without triggers: 15 milliseconds.
  • With one trigger: 15 milliseconds.
  • With two triggers: 130 milliseconds.
  • With three triggers: 2894 milliseconds.

This exponential growth in insertion time makes the solution impractical for real-world use cases, especially when dealing with thousands of records.

Possible Causes of Exponential Insertion Time

The exponential insertion time can be attributed to several factors:

  1. Recursive CTE Overhead: Recursive CTEs are inherently computationally expensive, especially when generating all possible substrings for each column value. The recursive nature of the CTE means that for each character in the input string, multiple recursive calls are made, leading to a combinatorial explosion of operations.

  2. Trigger Execution Order: Each trigger operates independently, meaning that the recursive CTE is executed multiple times for each insertion. This redundancy amplifies the computational overhead.

  3. Auxiliary Table Operations: The use of the fts_aux table to filter duplicates introduces additional I/O operations. While the UNIQUE constraint ensures no duplicates are inserted, the process of inserting into and deleting from this table adds significant overhead.

  4. FTS5 Indexing: FTS5 virtual tables are optimized for full-text search but are not designed for high-frequency, high-volume insertions, especially when combined with complex preprocessing logic like recursive CTEs.

  5. Data Characteristics: The length and complexity of the strings in the cn, pn, and sched columns also contribute to the problem. Longer strings result in more recursive calls and more substrings to process.

  6. Tokenization Mismatch: The default FTS5 tokenizer may not be well-suited for the specific requirements of substring searches. The tokenizer is designed to break text into words based on alphanumeric characters, which may not align with the need to search for arbitrary substrings.

Troubleshooting Steps, Solutions, and Fixes

To address the exponential insertion time, several strategies can be employed:

1. Optimize the Recursive CTE

The recursive CTE is the primary bottleneck. To reduce its overhead:

  • Limit Substring Generation: Instead of generating all possible substrings, generate only those that are necessary for the search. For example, if the search only requires substrings of a minimum length, modify the CTE to stop recursion once the substring length falls below this threshold.
  • Avoid Redundant Recursion: The current CTE generates substrings from both the left and right sides of the string. If only one direction is needed, eliminate the redundant recursion branch.

Example:

WITH RECURSIVE str(cn, idx) AS (
  SELECT NEW.cn, 2
  UNION ALL
  SELECT substr(cn, 2), idx+1 FROM str WHERE length(cn) > 3
)
SELECT NEW.id, cn FROM str;

2. Consolidate Triggers

Instead of having separate triggers for each FTS table, consolidate them into a single trigger. This reduces the number of times the recursive CTE is executed.

Example:

CREATE TRIGGER IF NOT EXISTS insert_fts AFTER INSERT ON bdata
BEGIN
  INSERT OR IGNORE INTO fts_aux (id, txt)
  WITH RECURSIVE str(cn, idx) AS (
    SELECT NEW.cn, 2
    UNION ALL
    SELECT substr(cn, 2), idx+1 FROM str WHERE length(cn) > 3
  )
  SELECT NEW.id, cn FROM str;

  INSERT INTO cn_fts (id, cn) SELECT id, txt FROM fts_aux;
  INSERT INTO pn_fts (id, pn) SELECT id, txt FROM fts_aux;
  INSERT INTO sched_fts (id, sched) SELECT id, txt FROM fts_aux;
  DELETE FROM fts_aux;
END;

3. Use a More Efficient Tokenizer

The default FTS5 tokenizer may not be ideal for substring searches. Consider using the trigram tokenizer, which is designed for substring matching.

Example:

CREATE VIRTUAL TABLE bdata_fts USING fts5(cn, pn, sched, tokenize=trigram);

The trigram tokenizer breaks text into three-character sequences, enabling efficient substring searches. However, it only supports substrings of three or more characters.

4. Implement Contentless FTS5 Tables

To avoid duplicating data between the bdata and FTS tables, use contentless or external content FTS5 tables. These tables store only the search index, reducing storage overhead and potentially improving insertion performance.

Example:

CREATE VIRTUAL TABLE bdata_fts USING fts5(cn, pn, sched, content='', tokenize=trigram);

5. Batch Inserts

If the JSON file contains thousands of records, consider batching the inserts. Instead of inserting one record at a time, insert multiple records in a single transaction. This reduces the overhead of trigger execution and index updates.

Example:

BEGIN TRANSACTION;
INSERT INTO bdata (cn, pn, sched) VALUES (...), (...), ...;
COMMIT;

6. Preprocess Data Outside SQLite

If the recursive CTE overhead remains prohibitive, consider preprocessing the data outside SQLite. Use a scripting language (e.g., Python) to generate the necessary substrings and insert them directly into the FTS tables.

Example (Python):

import sqlite3

def generate_substrings(s, min_length=3):
    return [s[i:i+length] for length in range(min_length, len(s)+1) for i in range(len(s)-length+1)]

conn = sqlite3.connect('database.db')
cursor = conn.cursor()

# Preprocess and insert data
for record in json_data:
    cn_substrings = generate_substrings(record['cn'])
    pn_substrings = generate_substrings(record['pn'])
    sched_substrings = generate_substrings(record['sched'])
    
    cursor.executemany('INSERT OR IGNORE INTO cn_fts (id, cn) VALUES (?, ?)', [(record['id'], s) for s in cn_substrings])
    cursor.executemany('INSERT OR IGNORE INTO pn_fts (id, pn) VALUES (?, ?)', [(record['id'], s) for s in pn_substrings])
    cursor.executemany('INSERT OR IGNORE INTO sched_fts (id, sched) VALUES (?, ?)', [(record['id'], s) for s in sched_substrings])

conn.commit()
conn.close()

7. Evaluate Alternative Solutions

If FTS5 proves unsuitable for the specific requirements, consider alternative solutions:

  • Regular Expressions: Use SQLite’s REGEXP extension (if available) for substring searches.
  • Custom Indexing: Implement a custom indexing scheme tailored to the specific search requirements.

Conclusion

The exponential insertion time in FTS5 with recursive CTE triggers is a complex issue with multiple contributing factors. By optimizing the recursive CTE, consolidating triggers, using a more efficient tokenizer, and considering alternative solutions, it is possible to significantly improve performance. Each of these strategies addresses a different aspect of the problem, and a combination of them may be necessary to achieve the desired results.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *