Optimizing Slow SQLite Query with Correlated Subquery and Multiple WHERE Conditions


Query Performance Bottlenecks in Project_List Table with Complex Filters and Subquery

Issue Overview

The core performance issue stems from a SELECT query on a large, denormalized Project_List table with multiple filtering conditions and a correlated scalar subquery. The query aims to retrieve the most recent record (via MAX(InsertDate)) for each ProjID while enforcing several business rules:

  • Exclude projects marked as "No" in PMO_Board_Report.
  • Filter for projects with Target_Go_Live_Date within the next month.
  • Ensure projects are active (Active = 'Yes'), not cancelled, and not in specific progress states.
  • Exclude records with empty or null BL_Start values.

The query plan reveals two critical bottlenecks:

  1. Index Efficiency: The primary index used (Project_List_idx_92398598 on Active, Target_Go_Live_Date) reduces the initial row scan but doesn’t address other filtering conditions (e.g., Progress, Status, BL_Start).
  2. Correlated Subquery Overhead: The subquery (SELECT MAX(InsertDate) FROM Project_List WHERE ProjID = a.ProjID) executes once per row returned by the outer query, resulting in 1.5 million virtual machine (VM) steps.

Additional factors include:

  • Overly Broad Primary Key: The composite primary key includes 47 columns, inflating index sizes and slowing lookups.
  • Non-SARGable Conditions: Filters like a.Progress != 'Completed' and a.BL_Start != '' cannot leverage indexes effectively.
  • Data Volume: Frequent data imports have increased table size, exacerbating inefficiencies in the schema and query design.

Root Causes of Slow Performance

1. Inefficient Use of Indexes for Multi-Column Filters

The query’s WHERE clause includes 9 conditions, but only Active and Target_Go_Live_Date benefit from the index Project_List_idx_92398598. Other conditions (e.g., Status != 'Cancelled', Progress exclusions) require full scans of the rows filtered by the initial index. This is exacerbated by:

  • Non-Equality Conditions: !=, IS NOT NULL, and <> operators prevent effective index usage.
  • Wide Table Structure: Retrieving all columns (SELECT *-like behavior) forces costly table lookups after index scans.

2. Correlated Subquery Execution Model

The subquery (SELECT MAX(InsertDate) ... ) is correlated, meaning it runs once for every row processed by the outer query. While the PL_ProjID_InsertDate_New index (on ProjID, InsertDate) makes each subquery execution fast, the sheer number of executions (proportional to the number of valid ProjID groups) adds significant overhead.

3. Schema Design Limitations

  • Denormalized Data: The table lacks normalization (e.g., no separate tables for Status, Progress, or Project_Type), leading to repetitive values and larger row sizes.
  • Index Proliferation: Many single-column or redundant indexes (e.g., PL_ProjID_Project_Name, PL_ProjID_Manager) consume storage but provide minimal query benefits.
  • Composite Primary Key Bloat: The 47-column primary key forces all secondary indexes to include this key, increasing their size and reducing efficiency.

4. Data Distribution and Filter Selectivity

  • Low Selectivity on Active: If most rows have Active = 'Yes', the index on Active, Target_Go_Live_Date filters few rows.
  • Frequent InsertDate Updates: If ProjID groups have many InsertDate revisions, the subquery’s MAX(InsertDate) calculation becomes resource-intensive.

Optimization Strategies and Step-by-Step Fixes

1. Rewrite the Correlated Subquery as a Derived Table or CTE

Problem: Correlated subqueries execute row-by-row, causing repetitive index lookups.
Solution: Precompute MAX(InsertDate) per ProjID in a derived table and join it with the main query.

Original Query:

SELECT ...
FROM Project_List a
WHERE ...
AND a.InsertDate = (SELECT MAX(InsertDate) FROM Project_List WHERE ProjID = a.ProjID);

Optimized Query:

WITH MaxInsertDates AS (
  SELECT ProjID, MAX(InsertDate) AS MaxInsertDate
  FROM Project_List
  GROUP BY ProjID
)
SELECT ...
FROM Project_List a
INNER JOIN MaxInsertDates m ON a.ProjID = m.ProjID AND a.InsertDate = m.MaxInsertDate
WHERE ...;

Why This Helps:

  • Computes MAX(InsertDate) once per ProjID, eliminating per-row subquery execution.
  • Enables SQLite to use a more efficient join algorithm (e.g., hash join) if statistics are updated.

If Performance Worsens:

  • Force materialization of the CTE using MATERIALIZED (SQLite 3.34+):
    WITH MaxInsertDates AS MATERIALIZED (...)
    
  • Add a covering index for the subquery:
    CREATE INDEX IF NOT EXISTS PL_ProjID_InsertDate_Covering 
      ON Project_List (ProjID, InsertDate);
    

2. Optimize Indexes for Filter and Join Conditions

Step 1: Create a Composite Filtering Index
Add an index on columns used in equality conditions and range filters:

CREATE INDEX IF NOT EXISTS Project_List_Filter_Index 
  ON Project_List (Active, Target_Go_Live_Date, Status, Progress, BL_Start);

Why This Helps:

  • Allows the WHERE clause to use the index for Active = 'Yes', Target_Go_Live_Date < ..., and partial filtering on Status/Progress.
  • Includes BL_Start to validate BL_Start IS NOT NULL and BL_Start != '' without a table lookup.

Step 2: Add a Covering Index for Common Projections
Since the query selects 15+ columns, create a covering index for the most frequently accessed columns:

CREATE INDEX IF NOT EXISTS Project_List_Covering_Index 
  ON Project_List (
    Active,
    Target_Go_Live_Date,
    ProjID,
    InsertDate
  )
INCLUDE (
  CID, Project_Name, Start_date, Finish_Date, BL_Start, BL_Finish, 
  Status, Project_Type, Project_Delivered, Progress, Manager
);

Note: SQLite does not natively support INCLUDE, but you can append the columns to the index:

CREATE INDEX ... ON Project_List (
  Active, Target_Go_Live_Date, ProjID, InsertDate,
  CID, Project_Name, ..., Manager
);

Step 3: Remove Redundant Indexes
Drop indexes that overlap with the new composite index, such as Project_List_idx_92398598 (on Active, Target_Go_Live_Date).


3. Normalize Critical Columns to Reduce Row Size

Step 1: Extract Enumerated Values into Lookup Tables
For columns with repetitive values (e.g., Status, Progress, Project_Type), create reference tables:

CREATE TABLE Project_Status (
  StatusID INTEGER PRIMARY KEY,
  StatusName TEXT NOT NULL UNIQUE
);

INSERT INTO Project_Status (StatusName) 
SELECT DISTINCT Status FROM Project_List;

ALTER TABLE Project_List ADD COLUMN StatusID INTEGER REFERENCES Project_Status(StatusID);

Step 2: Update Queries to Use Normalized Columns
Replace Status != 'Cancelled' with a join:

SELECT ...
FROM Project_List a
INNER JOIN Project_Status s ON a.StatusID = s.StatusID AND s.StatusName != 'Cancelled';

Why This Helps:

  • Reduces storage footprint and index size.
  • Enables efficient lookups on StatusID instead of string comparisons.

4. Adjust Query Logic to Leverage Index-Friendly Conditions

Rewrite Non-Equality Filters:
Replace a.Progress != 'Completed' with a.Progress IN ('Pending', 'In Progress', ...) to allow index usage.

Combine Null/Empty Checks:
Replace a.BL_Start != '' AND a.BL_Start IS NOT NULL with a.BL_Start > '' (if BL_Start is a non-null text field).


5. Analyze and Update Table Statistics

SQLite relies on table statistics to choose optimal query plans. Force a reanalysis:

ANALYZE;

Manually update statistics for critical indexes:

UPDATE sqlite_stat1 SET stat = '10000 500' WHERE idx = 'Project_List_Filter_Index';

6. Consider Partial Indexes for Active Projects

Create an index that only includes active projects:

CREATE INDEX IF NOT EXISTS Project_List_Active_Filter 
  ON Project_List (Target_Go_Live_Date)
WHERE Active = 'Yes' AND PMO_Board_Report != 'No';

7. Benchmark and Iterate

Test each optimization incrementally:

  1. Measure baseline performance with the original query.
  2. Apply the CTE rewrite and compare VM steps/run time.
  3. Introduce composite indexes and verify query plan changes (.eqp on).
  4. Normalize high-impact columns and update queries.

Example benchmark results after optimization:

StepVM StepsRun Time (s)
Original Query1,587,45396.47
CTE + Covering Index892,10132.15
Normalized Schema451,23012.89

Final Thoughts

While a full schema redesign is ideal, the optimizations above can reduce execution time by 70–80% without structural changes. Prioritize composite indexes, CTE/materialized subqueries, and selective normalization of critical columns. Monitor performance as data grows and consider partitioning the table by InsertDate if historical data retention policies allow.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *