Optimizing SQLite Query Performance: Avoiding Slow Index Scans with Nested Searches

Understanding the Query Plan and Performance Bottleneck

The core issue revolves around SQLite’s query planner choosing a suboptimal execution plan for a query that involves a join with a view (tag_min) and a range filter on the primary key (foo.id BETWEEN 1000 AND 2000). The query is designed to fetch a small subset of rows from a large table (foo), but the planner insists on performing a full index scan on the foo_tag index, which results in poor performance. The desired behavior is for SQLite to use a nested loop search strategy, where the tag_min subquery is evaluated for each row in the range, rather than scanning the entire index.

The schema includes a table foo with columns id (primary key), tag, and val, along with two indexes: foo_tag on tag and foo_val on val. The tag_min view aggregates the minimum val for each tag. The query in question joins foo with tag_min and filters rows where foo.id is within a specific range and tag_min.val equals 1. The query is expected to return only a few rows, but the planner’s choice of a full index scan makes it inefficient.

Why SQLite Chooses a Slow Index Scan Over a Nested Search

The primary reason SQLite opts for a full index scan instead of a nested search lies in how the query planner estimates costs and evaluates potential execution plans. SQLite’s planner relies on statistics and heuristics to determine the most efficient way to execute a query. In this case, the planner believes that scanning the foo_tag index and materializing the tag_min view is the most cost-effective approach. However, this decision is based on incomplete or inaccurate assumptions about the distribution of data and the selectivity of the filters.

One key factor is the lack of correlation between the foo.id range filter and the tag_min.val filter in the planner’s cost model. The planner does not recognize that the combination of these filters drastically reduces the number of rows that need to be examined. Instead, it treats the tag_min view as a separate entity that must be fully materialized before the join can proceed. This results in unnecessary work, as the majority of the rows scanned in the foo_tag index are irrelevant to the final result.

Another contributing factor is the structure of the indexes. The foo_tag index only includes the tag column, which means that SQLite must perform additional lookups to retrieve the val column for the MIN(val) aggregation. This increases the cost of the index scan and makes the planner less likely to choose a nested search strategy. If the index included both tag and val (e.g., CREATE INDEX foo_tag_val ON foo(tag, val)), the planner might have more flexibility in choosing an efficient execution plan.

Strategies to Improve Query Performance Without Correlated Subqueries

To address the performance issue without resorting to correlated subqueries, several strategies can be employed. Each approach aims to guide the query planner toward a more efficient execution plan while preserving the use of the tag_min view and allowing access to additional columns in the view.

1. Restructuring the Indexes

One effective way to improve performance is to create a composite index that includes both tag and val. This allows SQLite to perform the MIN(val) aggregation directly from the index, reducing the need for additional lookups. For example:

CREATE INDEX foo_tag_val ON foo(tag, val);

With this index, the planner can efficiently retrieve the minimum val for each tag without scanning the entire table. This change alone can significantly reduce the number of byte-code instructions required to execute the query.

2. Using Common Table Expressions (CTEs) with NOT MATERIALIZED

Another approach is to use a Common Table Expression (CTE) with the NOT MATERIALIZED hint. This prevents SQLite from materializing the CTE and instead evaluates it as part of the main query. For example:

WITH tag_min(tag, val) AS NOT MATERIALIZED
    (SELECT tag, MIN(val) FROM foo GROUP BY tag)
SELECT foo.id FROM foo
JOIN tag_min ON tag_min.tag = foo.tag
WHERE foo.id BETWEEN 1000 AND 2000
 AND tag_min.val = 1
ORDER BY foo.id LIMIT 5;

This approach encourages the planner to evaluate the tag_min subquery for each row in the range, resulting in a more efficient nested loop search.

3. Pushing Filters into the View

Manually pushing the WHERE condition into the view can also improve performance. This reduces the number of rows that need to be processed by the tag_min view. For example:

WITH tag_min(tag, val) AS
    (SELECT tag, MIN(val) FROM foo
    WHERE tag IN (SELECT tag FROM foo WHERE id BETWEEN 1000 AND 2000)
    GROUP BY tag)
SELECT foo.id FROM foo
JOIN tag_min ON tag_min.tag = foo.tag
WHERE foo.id BETWEEN 1000 AND 2000
 AND tag_min.val = 1
ORDER BY foo.id LIMIT 5;

This approach replaces the full index scan with a partial scan, which is more efficient but still not as fast as the correlated subquery.

4. Leveraging Window Functions

Window functions can be used to express the query without joins, but they often result in poor performance due to the way SQLite handles them. For example:

SELECT id
FROM (SELECT id, MIN(val) OVER (PARTITION BY tag) AS min_val FROM foo)
WHERE id BETWEEN 1000 AND 2000 AND min_val = 1
ORDER BY id
LIMIT 5;

While this approach avoids joins, it typically performs worse than other methods because it requires scanning the entire table.

5. Combining Indexes and CTEs

A hybrid approach that combines composite indexes and CTEs can yield the best results. For example:

CREATE INDEX foo_tag_val ON foo(tag, val);

WITH tag_min(tag, val) AS NOT MATERIALIZED
    (SELECT tag, MIN(val) FROM foo GROUP BY tag)
SELECT foo.id FROM foo
JOIN tag_min ON tag_min.tag = foo.tag
WHERE foo.id BETWEEN 1000 AND 2000
 AND tag_min.val = 1
ORDER BY foo.id LIMIT 5;

This approach leverages the efficiency of the composite index while using the NOT MATERIALIZED hint to avoid unnecessary materialization.

Conclusion

The key to optimizing this query lies in understanding how SQLite’s query planner makes decisions and guiding it toward a more efficient execution plan. By restructuring indexes, using CTEs with NOT MATERIALIZED, pushing filters into views, and combining these techniques, it is possible to achieve significant performance improvements without resorting to correlated subqueries. Each approach has its trade-offs, and the best solution depends on the specific requirements of the application, such as the need to use the tag_min view or access additional columns. With careful tuning and experimentation, it is possible to strike the right balance between performance and maintainability.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *