Optimizing SQLite Queries: Replacing NOT IN with EXCEPT for Performance Gains
Understanding the Performance Discrepancy Between NOT IN and EXCEPT Queries
The core issue revolves around the performance discrepancy observed when using NOT IN (<subquery>)
versus EXCEPT
in SQLite queries. Specifically, the discussion highlights a scenario where a query involving NOT IN
takes significantly longer to execute compared to an equivalent query rewritten using EXCEPT
. This performance gap is particularly evident when dealing with large datasets, where the subquery returns hundreds of thousands of rows, and the outer query processes a smaller set of rows.
The query in question involves a temporary table ImagesOnDisk
, which stores metadata about PNG images stored on disk. The goal is to identify and delete outdated blobs by comparing the UUID and boolean values in ImagesOnDisk
against a larger dataset stored in the Printing
and CardFace
tables. The NOT IN
query scans the ImagesOnDisk
table and performs a subquery to filter out rows that exist in the Printing
and CardFace
tables. The EXCEPT
query, on the other hand, uses a compound query to achieve the same result but does so more efficiently.
The query plans reveal that the NOT IN
approach involves a full scan of the ImagesOnDisk
table and a subquery that scans the Printing
table and searches the CardFace
table. In contrast, the EXCEPT
query uses a compound query with a temporary B-tree to store intermediate results, which significantly reduces the execution time. This discrepancy suggests that the SQLite query planner may not be optimizing NOT IN
queries as effectively as it could, especially when dealing with large datasets.
Exploring the Root Causes of the Performance Gap
Several factors contribute to the performance gap between NOT IN
and EXCEPT
queries in SQLite. One of the primary reasons is the way SQLite handles subqueries in NOT IN
clauses. When a NOT IN
clause is used, SQLite must evaluate the subquery for each row in the outer query, which can lead to inefficiencies, especially when the subquery returns a large number of rows. This is because SQLite does not have a built-in mechanism to optimize the evaluation of NOT IN
subqueries in the same way it does for other types of queries.
Another factor is the use of temporary structures in the EXCEPT
query. The EXCEPT
operator inherently creates a temporary result set that contains the difference between the two input sets. This temporary result set can be efficiently stored in a B-tree, which allows for faster lookups and comparisons. In contrast, the NOT IN
query does not create such a temporary structure, leading to repeated evaluations of the subquery for each row in the outer query.
Additionally, the query planner’s cost estimation may not accurately reflect the true cost of executing NOT IN
queries. The query planner estimates the cost of each query plan based on factors such as the number of rows, the presence of indexes, and the complexity of the query. However, these estimates may not fully account for the overhead associated with repeatedly evaluating a subquery in a NOT IN
clause. As a result, the query planner may choose a less optimal plan for NOT IN
queries, leading to slower execution times.
The discussion also highlights the importance of indexing in query performance. While the NOT IN
query uses existing indexes on the Printing
and CardFace
tables, these indexes may not be sufficient to offset the overhead of repeatedly evaluating the subquery. In contrast, the EXCEPT
query benefits from the temporary B-tree structure, which effectively acts as an index for the intermediate result set. This allows the EXCEPT
query to perform lookups and comparisons more efficiently, even without additional indexes on the underlying tables.
Step-by-Step Troubleshooting and Optimization Strategies
To address the performance gap between NOT IN
and EXCEPT
queries, several troubleshooting and optimization strategies can be employed. The first step is to analyze the query plans for both the NOT IN
and EXCEPT
queries to identify any inefficiencies in the way SQLite is executing the queries. This can be done using the EXPLAIN QUERY PLAN
statement, which provides detailed information about the steps SQLite takes to execute a query.
Once the query plans have been analyzed, the next step is to consider rewriting the NOT IN
query using the EXCEPT
operator. As demonstrated in the discussion, the EXCEPT
query consistently performs better than the NOT IN
query, especially when dealing with large datasets. The EXCEPT
query can be further optimized by ensuring that the temporary result set is stored in a B-tree, which allows for faster lookups and comparisons.
Another optimization strategy is to use Common Table Expressions (CTEs) to simplify the query and improve performance. The discussion suggests using a CTE to store the result of the EXCEPT
operation and then joining this result with the original table to retrieve the desired columns. This approach can help reduce the complexity of the query and improve readability, while also potentially improving performance.
For example, the following query uses a CTE to store the result of the EXCEPT
operation and then joins this result with the ImagesOnDisk
table to retrieve the desired columns:
WITH todo(scryfall_id, is_front) AS (
SELECT scryfall_id, is_front
FROM ImagesOnDisk
EXCEPT
SELECT scryfall_id, is_front
FROM Printing
JOIN CardFace USING (printing_id)
)
SELECT scryfall_id, is_front, highres_on_disk, absolute_path
FROM ImagesOnDisk
JOIN todo USING (scryfall_id, is_front);
This query first creates a temporary result set todo
that contains the difference between the ImagesOnDisk
table and the Printing
and CardFace
tables. It then joins this result set with the ImagesOnDisk
table to retrieve the desired columns. This approach can help improve performance by reducing the number of times the subquery needs to be evaluated and by leveraging the temporary B-tree structure created by the EXCEPT
operator.
In addition to rewriting the query, it is also important to ensure that the underlying tables are properly indexed. While the EXCEPT
query may not require additional indexes to perform well, having appropriate indexes on the Printing
and CardFace
tables can help improve the performance of the subquery. For example, an index on the printing_id
column in the CardFace
table can help speed up the join operation between the Printing
and CardFace
tables.
Finally, it is important to consider the overall design of the database and the specific requirements of the application. In some cases, it may be more efficient to denormalize the data or to use a different database schema that better supports the types of queries being performed. For example, if the ImagesOnDisk
table is frequently queried for outdated blobs, it may be beneficial to store additional metadata in the table or to use a different data structure that allows for faster lookups and comparisons.
In conclusion, the performance gap between NOT IN
and EXCEPT
queries in SQLite can be addressed through a combination of query rewriting, indexing, and database design optimization. By analyzing the query plans, rewriting the query using the EXCEPT
operator, and ensuring that the underlying tables are properly indexed, it is possible to significantly improve the performance of queries that involve large datasets. Additionally, using CTEs and considering the overall design of the database can help further optimize query performance and ensure that the application meets its performance requirements.