FTS5 Trigger Misuse and External Content Indexing Issues in SQLite

Issue Overview: FTS5 Trigger Misuse and External Content Indexing

The core issue revolves around the misuse of SQLite’s FTS5 (Full-Text Search) extension, specifically in the context of managing a wiki-like notebook application. The goal is to maintain a pages table where each page revision is stored, and only the most recent revision of each page should be indexed in the FTS5 table for search purposes. The user attempted to achieve this by using triggers to manage the FTS5 index, but the implementation led to unexpected results, including the indexing of all revisions instead of just the most recent ones.

The user’s initial approach involved creating a pages table to store page revisions, with each revision having a title, a timestamp, and a body. The FTS5 table, pages_fts, was intended to index only the most recent revision of each page. Triggers were set up to manage the FTS5 index: a BEFORE INSERT trigger to delete the old entry from the FTS5 table, an AFTER INSERT trigger to insert the new entry, and an AFTER DELETE trigger to remove entries from the FTS5 table when a page is deleted.

However, the implementation did not work as expected. The FTS5 table ended up indexing all revisions of the pages, leading to multiple search results for the same page title. The user also encountered issues with the integrity check, which failed due to inconsistencies between the pages table and the pages_fts table.

Possible Causes: Misunderstanding FTS5 External Content Indexing

The primary cause of the issue lies in the misunderstanding of how FTS5’s external content indexing works. When using the content and content_rowid options in FTS5, the FTS5 table is tightly coupled with the external content table (pages in this case). This coupling means that the FTS5 table must always reflect the exact state of the external content table. Any attempt to selectively index rows from the external content table will result in inconsistencies and potential database corruption.

In the user’s initial setup, the FTS5 table was configured with an explicit id column, which was intended to map to the id column in the pages table. However, this approach conflicted with the implicit rowid mechanism that FTS5 uses to manage its internal indexing. The triggers were designed to delete and insert rows in the FTS5 table based on the id column, but this led to the FTS5 table containing multiple entries for the same page, as the id column did not align with the rowid mechanism.

Another issue was the use of the content and content_rowid options, which enforce a strict synchronization between the FTS5 table and the external content table. The user’s intention was to index only the most recent revisions of each page, but the external content indexing mechanism does not allow for such selective indexing. This led to the FTS5 table containing entries for all revisions, rather than just the most recent ones.

Troubleshooting Steps, Solutions & Fixes: Correcting FTS5 Index Management

To resolve the issues, the user needed to abandon the external content indexing approach and instead manage the FTS5 index manually. This involves creating the FTS5 table without the content and content_rowid options, allowing for more flexible index management. The triggers were then modified to ensure that only the most recent revision of each page is indexed in the FTS5 table.

The corrected approach involves the following steps:

  1. Reconfigure the FTS5 Table: The FTS5 table should be created without the content and content_rowid options. This allows for manual management of the index, rather than relying on the external content indexing mechanism. The FTS5 table should only include the columns that need to be indexed, such as title and body.

    CREATE VIRTUAL TABLE pages_fts USING FTS5(
      title,
      body
    );
    
  2. Modify the Triggers: The triggers need to be updated to ensure that only the most recent revision of each page is indexed in the FTS5 table. The BEFORE INSERT trigger should delete any existing entries for the same title in the FTS5 table, and the AFTER INSERT trigger should insert the new entry. This ensures that the FTS5 table always contains only the most recent revision of each page.

    CREATE TRIGGER pages_after_insert AFTER INSERT ON pages
    BEGIN
      DELETE FROM pages_fts WHERE title = new.title;
      INSERT INTO pages_fts (title, body)
      VALUES (new.title, new.body);
    END;
    
  3. Handle Deletions: If a page is deleted from the pages table, the corresponding entry in the FTS5 table should also be deleted. This can be achieved with an AFTER DELETE trigger.

    CREATE TRIGGER pages_after_delete AFTER DELETE ON pages
    BEGIN
      DELETE FROM pages_fts WHERE title = old.title;
    END;
    
  4. Integrity Checks: Since the FTS5 table is now managed manually, the integrity check will no longer fail due to inconsistencies between the pages table and the pages_fts table. However, it is still important to ensure that the triggers are functioning correctly and that the FTS5 table is being updated as expected.

  5. Search Queries: With the corrected setup, search queries will now return only the most recent revision of each page. The SELECT statements used to search the FTS5 table will return the expected results, with only one entry per page title.

    -- title search
    SELECT title, snippet(pages_fts, 1, '>', '<', '...', 10)
    AS snippet
    FROM pages_fts
    WHERE pages_fts
    MATCH 'title:home'
    ORDER BY rank
    LIMIT 50;
    
    -- body search
    SELECT title, snippet(pages_fts, 2, '>', '<', '...', 64)
    AS snippet
    FROM pages_fts
    WHERE pages_fts
    MATCH 'body:version'
    ORDER BY rank
    LIMIT 50;
    
  6. Linking to the Pages Table: If there is a need to link the FTS5 table back to the pages table, such as retrieving the id of the corresponding row in the pages table, this can be achieved by including the id column in the FTS5 table. However, this requires careful management to ensure that the id column is correctly synchronized with the pages table.

    CREATE VIRTUAL TABLE pages_fts USING FTS5(
      id UNINDEXED,
      title,
      body
    );
    
    CREATE TRIGGER pages_after_insert AFTER INSERT ON pages
    BEGIN
      DELETE FROM pages_fts WHERE title = new.title;
      INSERT INTO pages_fts (id, title, body)
      VALUES (new.id, new.title, new.body);
    END;
    

    In this setup, the id column is included in the FTS5 table but marked as UNINDEXED, meaning it will not be used for full-text search but can be used to retrieve the corresponding row from the pages table.

By following these steps, the user can achieve the desired functionality of indexing only the most recent revisions of each page in the FTS5 table, while maintaining the integrity of the database and ensuring that search queries return the expected results. This approach avoids the pitfalls of external content indexing and allows for more flexible and controlled management of the FTS5 index.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *