Optimizing SQLite Query Performance with FTS and Deduplication

Issue Overview: Slow Query Performance with FTS and Deduplication

The core issue revolves around optimizing a query that combines Full-Text Search (FTS) with a deduplication requirement in SQLite. The figureCitations table contains over 2 million rows, with a significant number of duplicate httpUri values. The goal is to efficiently query this table using FTS on the captionText column while ensuring that duplicate httpUri values are handled appropriately. The current implementation involves creating a view (images) that filters and groups the data, but the query performance degrades significantly when joining this view with the FTS index.

The figureCitations table schema includes several indexes, and an FTS virtual table (figureCitationsFts) is used for fast text searches. While the FTS query alone is fast (5 ms), the combined query that joins the view with the FTS table takes over 21 seconds. The challenge is to optimize this query without resorting to duplicating data or creating additional tables that would require maintenance.

Possible Causes: Inefficient View Usage and Lack of Index Optimization

The primary cause of the slow query performance lies in the inefficiency of using a view (images) that performs filtering and grouping operations. Views in SQLite cannot be indexed, which means that every query against the view must re-evaluate the filtering and grouping logic. This becomes particularly problematic when the view is joined with the FTS table, as SQLite must scan the entire view result set before applying the FTS match condition.

Additionally, the GROUP BY clause in the view introduces ambiguity in row selection, as SQLite does not guarantee which row will be selected when multiple rows have the same httpUri. This can lead to inconsistent results and further complicates the query optimization process.

Another contributing factor is the lack of a dedicated index that combines the filtering conditions (httpUri != '', httpUri NOT LIKE 'http://dx.doi.org%', and captionText != '') with the deduplication requirement. While partial indexes and generated columns have been suggested, they only partially address the issue, as they do not fully optimize the deduplication process.

Troubleshooting Steps, Solutions & Fixes: Normalization, Indexing, and Query Refactoring

To address the performance issues, a multi-faceted approach is required, involving data normalization, indexing strategies, and query refactoring. Below are the detailed steps and solutions to optimize the query:

1. Normalize the Data by Separating httpUri into a Dedicated Table

The first step is to normalize the data by separating the httpUri values into a dedicated table. This approach reduces redundancy and simplifies the deduplication process. The new table, uniqueUris, will store each unique httpUri along with a unique identifier. The figureCitations table will then reference this identifier instead of storing the httpUri directly.

CREATE TABLE uniqueUris (
  uriId INTEGER PRIMARY KEY,
  httpUri TEXT UNIQUE
);

-- Populate the uniqueUris table with distinct httpUri values
INSERT INTO uniqueUris (httpUri)
SELECT DISTINCT httpUri
FROM figureCitations
WHERE httpUri != '' AND httpUri NOT LIKE 'http://dx.doi.org%';

-- Add a reference to uniqueUris in figureCitations
ALTER TABLE figureCitations ADD COLUMN uriId INTEGER REFERENCES uniqueUris(uriId);

-- Update the figureCitations table to set the uriId based on uniqueUris
UPDATE figureCitations
SET uriId = (
  SELECT uriId FROM uniqueUris WHERE uniqueUris.httpUri = figureCitations.httpUri
)
WHERE httpUri != '' AND httpUri NOT LIKE 'http://dx.doi.org%';

This normalization step ensures that each httpUri is stored only once, and the figureCitations table references these unique values. This reduces the complexity of deduplication and improves query performance.

2. Create a Composite Index on Filtered Columns

With the normalized data structure, the next step is to create a composite index that includes the filtering conditions. This index will speed up the filtering process and reduce the number of rows that need to be scanned during the query.

CREATE INDEX ix_figureCitations_filtered ON figureCitations (uriId, captionText)
WHERE httpUri != '' AND httpUri NOT LIKE 'http://dx.doi.org%' AND captionText != '';

This index ensures that only the relevant rows are considered during the query, further optimizing performance.

3. Refactor the Query to Leverage Normalized Data and Indexes

With the normalized data and composite index in place, the query can be refactored to take advantage of these optimizations. The new query will join the figureCitations table with the uniqueUris table and the FTS index, ensuring that only the necessary rows are scanned.

SELECT fc.*
FROM figureCitations fc
JOIN uniqueUris uu ON fc.uriId = uu.uriId
JOIN figureCitationsFts ff ON fc.id = ff.rowid
WHERE ff.captionText MATCH 'phylogeny'
AND fc.httpUri != ''
AND fc.httpUri NOT LIKE 'http://dx.doi.org%'
AND fc.captionText != '';

This query leverages the normalized data structure and the composite index, resulting in significantly improved performance.

4. Use Triggers to Maintain Data Consistency

To ensure that the normalized data structure remains consistent, triggers can be used to automatically update the uniqueUris table and the uriId references in figureCitations whenever new data is inserted or updated.

-- Trigger to insert new uniqueUris on figureCitations insert
CREATE TRIGGER tr_insert_figureCitations
AFTER INSERT ON figureCitations
FOR EACH ROW
WHEN NEW.httpUri != '' AND NEW.httpUri NOT LIKE 'http://dx.doi.org%'
BEGIN
  INSERT OR IGNORE INTO uniqueUris (httpUri) VALUES (NEW.httpUri);
  UPDATE figureCitations SET uriId = (SELECT uriId FROM uniqueUris WHERE httpUri = NEW.httpUri) WHERE id = NEW.id;
END;

-- Trigger to update uniqueUris on figureCitations update
CREATE TRIGGER tr_update_figureCitations
AFTER UPDATE ON figureCitations
FOR EACH ROW
WHEN NEW.httpUri != '' AND NEW.httpUri NOT LIKE 'http://dx.doi.org%'
BEGIN
  INSERT OR IGNORE INTO uniqueUris (httpUri) VALUES (NEW.httpUri);
  UPDATE figureCitations SET uriId = (SELECT uriId FROM uniqueUris WHERE httpUri = NEW.httpUri) WHERE id = NEW.id;
END;

These triggers ensure that the uniqueUris table and the uriId references are always up to date, maintaining data consistency without manual intervention.

5. Optimize FTS Index Usage

Finally, to further optimize the FTS index usage, consider using the content option in the FTS virtual table to directly reference the captionText column from the figureCitations table. This eliminates the need for a separate join and improves query performance.

CREATE VIRTUAL TABLE figureCitationsFts USING fts5 (
  captionText,
  content='figureCitations',
  content_rowid='id'
);

This configuration ensures that the FTS index directly references the captionText column in the figureCitations table, reducing the overhead of joining the FTS table with the main table.

Conclusion

By normalizing the data, creating composite indexes, refactoring the query, and using triggers to maintain data consistency, the query performance can be significantly improved. This approach eliminates the need for duplicating data or creating additional tables, ensuring that the database remains efficient and maintainable. The final solution provides a robust and scalable way to handle deduplication and FTS queries in SQLite, making it suitable for large datasets with complex filtering requirements.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *