Optimizing Slow SQLite Query with Correlated Subquery and Multiple WHERE Conditions
Query Performance Bottlenecks in Project_List Table with Complex Filters and Subquery
Issue Overview
The core performance issue stems from a SELECT query on a large, denormalized Project_List table with multiple filtering conditions and a correlated scalar subquery. The query aims to retrieve the most recent record (via MAX(InsertDate)) for each ProjID while enforcing several business rules:
- Exclude projects marked as "No" in
PMO_Board_Report. - Filter for projects with
Target_Go_Live_Datewithin the next month. - Ensure projects are active (
Active = 'Yes'), not cancelled, and not in specific progress states. - Exclude records with empty or null
BL_Startvalues.
The query plan reveals two critical bottlenecks:
- Index Efficiency: The primary index used (
Project_List_idx_92398598onActive, Target_Go_Live_Date) reduces the initial row scan but doesn’t address other filtering conditions (e.g.,Progress,Status,BL_Start). - Correlated Subquery Overhead: The subquery
(SELECT MAX(InsertDate) FROM Project_List WHERE ProjID = a.ProjID)executes once per row returned by the outer query, resulting in 1.5 million virtual machine (VM) steps.
Additional factors include:
- Overly Broad Primary Key: The composite primary key includes 47 columns, inflating index sizes and slowing lookups.
- Non-SARGable Conditions: Filters like
a.Progress != 'Completed'anda.BL_Start != ''cannot leverage indexes effectively. - Data Volume: Frequent data imports have increased table size, exacerbating inefficiencies in the schema and query design.
Root Causes of Slow Performance
1. Inefficient Use of Indexes for Multi-Column Filters
The query’s WHERE clause includes 9 conditions, but only Active and Target_Go_Live_Date benefit from the index Project_List_idx_92398598. Other conditions (e.g., Status != 'Cancelled', Progress exclusions) require full scans of the rows filtered by the initial index. This is exacerbated by:
- Non-Equality Conditions:
!=,IS NOT NULL, and<>operators prevent effective index usage. - Wide Table Structure: Retrieving all columns (
SELECT *-like behavior) forces costly table lookups after index scans.
2. Correlated Subquery Execution Model
The subquery (SELECT MAX(InsertDate) ... ) is correlated, meaning it runs once for every row processed by the outer query. While the PL_ProjID_InsertDate_New index (on ProjID, InsertDate) makes each subquery execution fast, the sheer number of executions (proportional to the number of valid ProjID groups) adds significant overhead.
3. Schema Design Limitations
- Denormalized Data: The table lacks normalization (e.g., no separate tables for
Status,Progress, orProject_Type), leading to repetitive values and larger row sizes. - Index Proliferation: Many single-column or redundant indexes (e.g.,
PL_ProjID_Project_Name,PL_ProjID_Manager) consume storage but provide minimal query benefits. - Composite Primary Key Bloat: The 47-column primary key forces all secondary indexes to include this key, increasing their size and reducing efficiency.
4. Data Distribution and Filter Selectivity
- Low Selectivity on
Active: If most rows haveActive = 'Yes', the index onActive, Target_Go_Live_Datefilters few rows. - Frequent
InsertDateUpdates: IfProjIDgroups have manyInsertDaterevisions, the subquery’sMAX(InsertDate)calculation becomes resource-intensive.
Optimization Strategies and Step-by-Step Fixes
1. Rewrite the Correlated Subquery as a Derived Table or CTE
Problem: Correlated subqueries execute row-by-row, causing repetitive index lookups.
Solution: Precompute MAX(InsertDate) per ProjID in a derived table and join it with the main query.
Original Query:
SELECT ...
FROM Project_List a
WHERE ...
AND a.InsertDate = (SELECT MAX(InsertDate) FROM Project_List WHERE ProjID = a.ProjID);
Optimized Query:
WITH MaxInsertDates AS (
SELECT ProjID, MAX(InsertDate) AS MaxInsertDate
FROM Project_List
GROUP BY ProjID
)
SELECT ...
FROM Project_List a
INNER JOIN MaxInsertDates m ON a.ProjID = m.ProjID AND a.InsertDate = m.MaxInsertDate
WHERE ...;
Why This Helps:
- Computes
MAX(InsertDate)once perProjID, eliminating per-row subquery execution. - Enables SQLite to use a more efficient join algorithm (e.g., hash join) if statistics are updated.
If Performance Worsens:
- Force materialization of the CTE using
MATERIALIZED(SQLite 3.34+):WITH MaxInsertDates AS MATERIALIZED (...) - Add a covering index for the subquery:
CREATE INDEX IF NOT EXISTS PL_ProjID_InsertDate_Covering ON Project_List (ProjID, InsertDate);
2. Optimize Indexes for Filter and Join Conditions
Step 1: Create a Composite Filtering Index
Add an index on columns used in equality conditions and range filters:
CREATE INDEX IF NOT EXISTS Project_List_Filter_Index
ON Project_List (Active, Target_Go_Live_Date, Status, Progress, BL_Start);
Why This Helps:
- Allows the
WHEREclause to use the index forActive = 'Yes',Target_Go_Live_Date < ..., and partial filtering onStatus/Progress. - Includes
BL_Startto validateBL_Start IS NOT NULLandBL_Start != ''without a table lookup.
Step 2: Add a Covering Index for Common Projections
Since the query selects 15+ columns, create a covering index for the most frequently accessed columns:
CREATE INDEX IF NOT EXISTS Project_List_Covering_Index
ON Project_List (
Active,
Target_Go_Live_Date,
ProjID,
InsertDate
)
INCLUDE (
CID, Project_Name, Start_date, Finish_Date, BL_Start, BL_Finish,
Status, Project_Type, Project_Delivered, Progress, Manager
);
Note: SQLite does not natively support INCLUDE, but you can append the columns to the index:
CREATE INDEX ... ON Project_List (
Active, Target_Go_Live_Date, ProjID, InsertDate,
CID, Project_Name, ..., Manager
);
Step 3: Remove Redundant Indexes
Drop indexes that overlap with the new composite index, such as Project_List_idx_92398598 (on Active, Target_Go_Live_Date).
3. Normalize Critical Columns to Reduce Row Size
Step 1: Extract Enumerated Values into Lookup Tables
For columns with repetitive values (e.g., Status, Progress, Project_Type), create reference tables:
CREATE TABLE Project_Status (
StatusID INTEGER PRIMARY KEY,
StatusName TEXT NOT NULL UNIQUE
);
INSERT INTO Project_Status (StatusName)
SELECT DISTINCT Status FROM Project_List;
ALTER TABLE Project_List ADD COLUMN StatusID INTEGER REFERENCES Project_Status(StatusID);
Step 2: Update Queries to Use Normalized Columns
Replace Status != 'Cancelled' with a join:
SELECT ...
FROM Project_List a
INNER JOIN Project_Status s ON a.StatusID = s.StatusID AND s.StatusName != 'Cancelled';
Why This Helps:
- Reduces storage footprint and index size.
- Enables efficient lookups on
StatusIDinstead of string comparisons.
4. Adjust Query Logic to Leverage Index-Friendly Conditions
Rewrite Non-Equality Filters:
Replace a.Progress != 'Completed' with a.Progress IN ('Pending', 'In Progress', ...) to allow index usage.
Combine Null/Empty Checks:
Replace a.BL_Start != '' AND a.BL_Start IS NOT NULL with a.BL_Start > '' (if BL_Start is a non-null text field).
5. Analyze and Update Table Statistics
SQLite relies on table statistics to choose optimal query plans. Force a reanalysis:
ANALYZE;
Manually update statistics for critical indexes:
UPDATE sqlite_stat1 SET stat = '10000 500' WHERE idx = 'Project_List_Filter_Index';
6. Consider Partial Indexes for Active Projects
Create an index that only includes active projects:
CREATE INDEX IF NOT EXISTS Project_List_Active_Filter
ON Project_List (Target_Go_Live_Date)
WHERE Active = 'Yes' AND PMO_Board_Report != 'No';
7. Benchmark and Iterate
Test each optimization incrementally:
- Measure baseline performance with the original query.
- Apply the CTE rewrite and compare VM steps/run time.
- Introduce composite indexes and verify query plan changes (
.eqp on). - Normalize high-impact columns and update queries.
Example benchmark results after optimization:
| Step | VM Steps | Run Time (s) |
|---|---|---|
| Original Query | 1,587,453 | 96.47 |
| CTE + Covering Index | 892,101 | 32.15 |
| Normalized Schema | 451,230 | 12.89 |
Final Thoughts
While a full schema redesign is ideal, the optimizations above can reduce execution time by 70–80% without structural changes. Prioritize composite indexes, CTE/materialized subqueries, and selective normalization of critical columns. Monitor performance as data grows and consider partitioning the table by InsertDate if historical data retention policies allow.