FTS5 Trigger Misuse and External Content Indexing Issues in SQLite
Issue Overview: FTS5 Trigger Misuse and External Content Indexing
The core issue revolves around the misuse of SQLite’s FTS5 (Full-Text Search) extension, specifically in the context of managing a wiki-like notebook application. The goal is to maintain a pages
table where each page revision is stored, and only the most recent revision of each page should be indexed in the FTS5 table for search purposes. The user attempted to achieve this by using triggers to manage the FTS5 index, but the implementation led to unexpected results, including the indexing of all revisions instead of just the most recent ones.
The user’s initial approach involved creating a pages
table to store page revisions, with each revision having a title, a timestamp, and a body. The FTS5 table, pages_fts
, was intended to index only the most recent revision of each page. Triggers were set up to manage the FTS5 index: a BEFORE INSERT
trigger to delete the old entry from the FTS5 table, an AFTER INSERT
trigger to insert the new entry, and an AFTER DELETE
trigger to remove entries from the FTS5 table when a page is deleted.
However, the implementation did not work as expected. The FTS5 table ended up indexing all revisions of the pages, leading to multiple search results for the same page title. The user also encountered issues with the integrity check, which failed due to inconsistencies between the pages
table and the pages_fts
table.
Possible Causes: Misunderstanding FTS5 External Content Indexing
The primary cause of the issue lies in the misunderstanding of how FTS5’s external content indexing works. When using the content
and content_rowid
options in FTS5, the FTS5 table is tightly coupled with the external content table (pages
in this case). This coupling means that the FTS5 table must always reflect the exact state of the external content table. Any attempt to selectively index rows from the external content table will result in inconsistencies and potential database corruption.
In the user’s initial setup, the FTS5 table was configured with an explicit id
column, which was intended to map to the id
column in the pages
table. However, this approach conflicted with the implicit rowid
mechanism that FTS5 uses to manage its internal indexing. The triggers were designed to delete and insert rows in the FTS5 table based on the id
column, but this led to the FTS5 table containing multiple entries for the same page, as the id
column did not align with the rowid
mechanism.
Another issue was the use of the content
and content_rowid
options, which enforce a strict synchronization between the FTS5 table and the external content table. The user’s intention was to index only the most recent revisions of each page, but the external content indexing mechanism does not allow for such selective indexing. This led to the FTS5 table containing entries for all revisions, rather than just the most recent ones.
Troubleshooting Steps, Solutions & Fixes: Correcting FTS5 Index Management
To resolve the issues, the user needed to abandon the external content indexing approach and instead manage the FTS5 index manually. This involves creating the FTS5 table without the content
and content_rowid
options, allowing for more flexible index management. The triggers were then modified to ensure that only the most recent revision of each page is indexed in the FTS5 table.
The corrected approach involves the following steps:
Reconfigure the FTS5 Table: The FTS5 table should be created without the
content
andcontent_rowid
options. This allows for manual management of the index, rather than relying on the external content indexing mechanism. The FTS5 table should only include the columns that need to be indexed, such astitle
andbody
.CREATE VIRTUAL TABLE pages_fts USING FTS5( title, body );
Modify the Triggers: The triggers need to be updated to ensure that only the most recent revision of each page is indexed in the FTS5 table. The
BEFORE INSERT
trigger should delete any existing entries for the same title in the FTS5 table, and theAFTER INSERT
trigger should insert the new entry. This ensures that the FTS5 table always contains only the most recent revision of each page.CREATE TRIGGER pages_after_insert AFTER INSERT ON pages BEGIN DELETE FROM pages_fts WHERE title = new.title; INSERT INTO pages_fts (title, body) VALUES (new.title, new.body); END;
Handle Deletions: If a page is deleted from the
pages
table, the corresponding entry in the FTS5 table should also be deleted. This can be achieved with anAFTER DELETE
trigger.CREATE TRIGGER pages_after_delete AFTER DELETE ON pages BEGIN DELETE FROM pages_fts WHERE title = old.title; END;
Integrity Checks: Since the FTS5 table is now managed manually, the integrity check will no longer fail due to inconsistencies between the
pages
table and thepages_fts
table. However, it is still important to ensure that the triggers are functioning correctly and that the FTS5 table is being updated as expected.Search Queries: With the corrected setup, search queries will now return only the most recent revision of each page. The
SELECT
statements used to search the FTS5 table will return the expected results, with only one entry per page title.-- title search SELECT title, snippet(pages_fts, 1, '>', '<', '...', 10) AS snippet FROM pages_fts WHERE pages_fts MATCH 'title:home' ORDER BY rank LIMIT 50; -- body search SELECT title, snippet(pages_fts, 2, '>', '<', '...', 64) AS snippet FROM pages_fts WHERE pages_fts MATCH 'body:version' ORDER BY rank LIMIT 50;
Linking to the Pages Table: If there is a need to link the FTS5 table back to the
pages
table, such as retrieving theid
of the corresponding row in thepages
table, this can be achieved by including theid
column in the FTS5 table. However, this requires careful management to ensure that theid
column is correctly synchronized with thepages
table.CREATE VIRTUAL TABLE pages_fts USING FTS5( id UNINDEXED, title, body ); CREATE TRIGGER pages_after_insert AFTER INSERT ON pages BEGIN DELETE FROM pages_fts WHERE title = new.title; INSERT INTO pages_fts (id, title, body) VALUES (new.id, new.title, new.body); END;
In this setup, the
id
column is included in the FTS5 table but marked asUNINDEXED
, meaning it will not be used for full-text search but can be used to retrieve the corresponding row from thepages
table.
By following these steps, the user can achieve the desired functionality of indexing only the most recent revisions of each page in the FTS5 table, while maintaining the integrity of the database and ensuring that search queries return the expected results. This approach avoids the pitfalls of external content indexing and allows for more flexible and controlled management of the FTS5 index.