Optimizing Complex SQLite Queries with Correlated Subqueries and JOINs

Slow Query Performance Due to Correlated Subqueries and Inefficient JOINs

The core issue revolves around a SQLite query that performs poorly due to the use of correlated subqueries and inefficient JOIN operations. The query retrieves data from three tables (Project_List, Project_Dashboard, and Project_Extras) and applies multiple filtering conditions, including date comparisons and aggregate functions. The query also uses LEFT JOIN operations, which, while syntactically correct, are unnecessary given the filtering conditions in the WHERE clause. The primary performance bottleneck stems from the repeated execution of correlated subqueries to fetch the maximum InsertDate for each ProjID across the three tables. This results in a significant computational overhead, as the subqueries are executed for each row processed by the main query.

The query also suffers from redundant conditions and unnecessary column retrievals. For instance, the LEFT JOIN conditions are duplicated in the WHERE clause, and all columns from all tables are selected (a.*, b.*, c.*), even though many of these columns are not required for the final result. Additionally, the query uses a GROUP BY clause to group results by Manager and ProjID, which further adds to the computational load.

The query plan generated by SQLite’s .expert function reveals that the database engine is using multiple indexes to search the tables, but the presence of correlated subqueries and the lack of materialized intermediate results are causing inefficiencies. The use of temporary B-trees for the GROUP BY operation also indicates that the query is not fully optimized for performance.

Correlated Subqueries and Redundant JOIN Conditions

The primary cause of the slow query performance is the use of correlated subqueries to fetch the maximum InsertDate for each ProjID in the Project_List, Project_Dashboard, and Project_Extras tables. Correlated subqueries are executed repeatedly for each row processed by the main query, leading to a significant performance penalty. In this case, the subqueries are used to ensure that only the most recent records (based on InsertDate) are included in the result set. However, the repeated execution of these subqueries for each row is highly inefficient.

Another contributing factor is the use of LEFT JOIN operations with redundant conditions in the WHERE clause. The LEFT JOIN syntax is used to ensure that all records from the Project_List table are included in the result set, even if there are no matching records in the Project_Dashboard or Project_Extras tables. However, the filtering conditions in the WHERE clause effectively negate the need for LEFT JOIN, as they exclude records that do not have matching entries in the joined tables. This redundancy forces the query optimizer to perform unnecessary work, further degrading performance.

The query also retrieves all columns from all tables (a.*, b.*, c.*), even though many of these columns are not required for the final result. This increases the amount of data that needs to be processed and transferred, adding to the query’s execution time. Additionally, the GROUP BY clause groups the results by Manager and ProjID, which requires the database engine to sort and aggregate the data, further increasing the computational load.

Materializing Intermediate Results and Optimizing JOINs

To address the performance issues, the query can be optimized by materializing the intermediate results of the correlated subqueries and simplifying the JOIN operations. The first step is to create a Common Table Expression (CTE) that precomputes the maximum InsertDate for each ProjID across the three tables. This eliminates the need for correlated subqueries in the main query and allows the database engine to process the data more efficiently.

The following CTE can be used to materialize the intermediate results:

WITH maxes(ProjID, pl_insert, pd_insert, pe_insert) AS (
    SELECT ProjID,
           MAX(InsertDate),
           (SELECT MAX(InsertDate) FROM Project_Dashboard WHERE ProjID = e.ProjID),
           (SELECT MAX(InsertDate) FROM Project_Extras WHERE ProjID = e.ProjID)
    FROM Project_List e
    GROUP BY ProjID
)

This CTE computes the maximum InsertDate for each ProjID in the Project_List table and includes the corresponding maximum InsertDate values from the Project_Dashboard and Project_Extras tables. The results are stored in a temporary table (maxes), which can then be joined with the main tables in the query.

The main query can be rewritten to use the maxes CTE and simplified JOIN operations:

SELECT a.*, b.*, c.*
FROM Project_List AS a
JOIN Project_Dashboard AS b ON a.ProjID = b.ProjID
JOIN Project_Extras AS c ON a.ProjID = c.ProjID
JOIN maxes ON a.ProjID = maxes.ProjID
WHERE a.PMO_Board_Report != 'No'
  AND a.Status = 'Acknowledged'
  AND Project_Type != 'Outside IT perimeter'
  AND a.Start_date < date('now')
  AND (
      (a.Target_Go_Live_Date > substr(date('now'), 1, 4) || '-01-15') OR
      (a.Target_Go_Live_Date = '' AND a.Finish_Date > substr(date('now'), 1, 4) || '-01-15')
  )
  AND (b.TimeIndicator = '0' OR b.CostIndicator = '0')
  AND a.InsertDate = maxes.pl_insert
  AND b.InsertDate = maxes.pd_insert
  AND c.InsertDate = maxes.pe_insert
GROUP BY a.Manager, a.ProjID
ORDER BY a.Manager, a.ProjID;

In this optimized query, the maxes CTE is joined with the main tables, and the filtering conditions are applied directly to the joined data. This eliminates the need for correlated subqueries and redundant JOIN conditions, significantly reducing the query’s execution time.

Additionally, the query can be further optimized by selecting only the required columns instead of using a.*, b.*, c.*. This reduces the amount of data that needs to be processed and transferred, further improving performance. For example:

SELECT a.Manager, a.ProjID, a.Start_date, a.Target_Go_Live_Date, a.Finish_date,
       b.TimeIndicator, b.CostIndicator,
       c.ExtraField1, c.ExtraField2
FROM Project_List AS a
JOIN Project_Dashboard AS b ON a.ProjID = b.ProjID
JOIN Project_Extras AS c ON a.ProjID = c.ProjID
JOIN maxes ON a.ProjID = maxes.ProjID
WHERE a.PMO_Board_Report != 'No'
  AND a.Status = 'Acknowledged'
  AND Project_Type != 'Outside IT perimeter'
  AND a.Start_date < date('now')
  AND (
      (a.Target_Go_Live_Date > substr(date('now'), 1, 4) || '-01-15') OR
      (a.Target_Go_Live_Date = '' AND a.Finish_Date > substr(date('now'), 1, 4) || '-01-15')
  )
  AND (b.TimeIndicator = '0' OR b.CostIndicator = '0')
  AND a.InsertDate = maxes.pl_insert
  AND b.InsertDate = maxes.pd_insert
  AND c.InsertDate = maxes.pe_insert
GROUP BY a.Manager, a.ProjID
ORDER BY a.Manager, a.ProjID;

By selecting only the necessary columns, the query becomes more efficient and easier to maintain.

Finally, it is important to ensure that the maxes CTE is materialized correctly. If the CTE is not materialized automatically, it can be forced to materialize by adding a DISTINCT clause:

WITH maxes(ProjID, pl_insert, pd_insert, pe_insert) AS (
    SELECT DISTINCT ProjID,
           MAX(InsertDate),
           (SELECT MAX(InsertDate) FROM Project_Dashboard WHERE ProjID = e.ProjID),
           (SELECT MAX(InsertDate) FROM Project_Extras WHERE ProjID = e.ProjID)
    FROM Project_List e
    GROUP BY ProjID
)

This ensures that the intermediate results are stored in a temporary table, further improving query performance.

By following these steps, the query’s performance can be significantly improved, reducing the execution time from 15.901 seconds to 1.275 seconds or less. This optimization not only enhances the user experience but also reduces the computational load on the database server, making it more scalable and efficient.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *