Optimizing Slow SQLite Query with Correlated Subquery and Multiple WHERE Conditions
Query Performance Bottlenecks in Project_List Table with Complex Filters and Subquery
Issue Overview
The core performance issue stems from a SELECT
query on a large, denormalized Project_List
table with multiple filtering conditions and a correlated scalar subquery. The query aims to retrieve the most recent record (via MAX(InsertDate)
) for each ProjID
while enforcing several business rules:
- Exclude projects marked as "No" in
PMO_Board_Report
. - Filter for projects with
Target_Go_Live_Date
within the next month. - Ensure projects are active (
Active = 'Yes'
), not cancelled, and not in specific progress states. - Exclude records with empty or null
BL_Start
values.
The query plan reveals two critical bottlenecks:
- Index Efficiency: The primary index used (
Project_List_idx_92398598
onActive, Target_Go_Live_Date
) reduces the initial row scan but doesn’t address other filtering conditions (e.g.,Progress
,Status
,BL_Start
). - Correlated Subquery Overhead: The subquery
(SELECT MAX(InsertDate) FROM Project_List WHERE ProjID = a.ProjID)
executes once per row returned by the outer query, resulting in 1.5 million virtual machine (VM) steps.
Additional factors include:
- Overly Broad Primary Key: The composite primary key includes 47 columns, inflating index sizes and slowing lookups.
- Non-SARGable Conditions: Filters like
a.Progress != 'Completed'
anda.BL_Start != ''
cannot leverage indexes effectively. - Data Volume: Frequent data imports have increased table size, exacerbating inefficiencies in the schema and query design.
Root Causes of Slow Performance
1. Inefficient Use of Indexes for Multi-Column Filters
The query’s WHERE
clause includes 9 conditions, but only Active
and Target_Go_Live_Date
benefit from the index Project_List_idx_92398598
. Other conditions (e.g., Status != 'Cancelled'
, Progress
exclusions) require full scans of the rows filtered by the initial index. This is exacerbated by:
- Non-Equality Conditions:
!=
,IS NOT NULL
, and<>
operators prevent effective index usage. - Wide Table Structure: Retrieving all columns (
SELECT *
-like behavior) forces costly table lookups after index scans.
2. Correlated Subquery Execution Model
The subquery (SELECT MAX(InsertDate) ... )
is correlated, meaning it runs once for every row processed by the outer query. While the PL_ProjID_InsertDate_New
index (on ProjID, InsertDate
) makes each subquery execution fast, the sheer number of executions (proportional to the number of valid ProjID
groups) adds significant overhead.
3. Schema Design Limitations
- Denormalized Data: The table lacks normalization (e.g., no separate tables for
Status
,Progress
, orProject_Type
), leading to repetitive values and larger row sizes. - Index Proliferation: Many single-column or redundant indexes (e.g.,
PL_ProjID_Project_Name
,PL_ProjID_Manager
) consume storage but provide minimal query benefits. - Composite Primary Key Bloat: The 47-column primary key forces all secondary indexes to include this key, increasing their size and reducing efficiency.
4. Data Distribution and Filter Selectivity
- Low Selectivity on
Active
: If most rows haveActive = 'Yes'
, the index onActive, Target_Go_Live_Date
filters few rows. - Frequent
InsertDate
Updates: IfProjID
groups have manyInsertDate
revisions, the subquery’sMAX(InsertDate)
calculation becomes resource-intensive.
Optimization Strategies and Step-by-Step Fixes
1. Rewrite the Correlated Subquery as a Derived Table or CTE
Problem: Correlated subqueries execute row-by-row, causing repetitive index lookups.
Solution: Precompute MAX(InsertDate)
per ProjID
in a derived table and join it with the main query.
Original Query:
SELECT ...
FROM Project_List a
WHERE ...
AND a.InsertDate = (SELECT MAX(InsertDate) FROM Project_List WHERE ProjID = a.ProjID);
Optimized Query:
WITH MaxInsertDates AS (
SELECT ProjID, MAX(InsertDate) AS MaxInsertDate
FROM Project_List
GROUP BY ProjID
)
SELECT ...
FROM Project_List a
INNER JOIN MaxInsertDates m ON a.ProjID = m.ProjID AND a.InsertDate = m.MaxInsertDate
WHERE ...;
Why This Helps:
- Computes
MAX(InsertDate)
once perProjID
, eliminating per-row subquery execution. - Enables SQLite to use a more efficient join algorithm (e.g., hash join) if statistics are updated.
If Performance Worsens:
- Force materialization of the CTE using
MATERIALIZED
(SQLite 3.34+):WITH MaxInsertDates AS MATERIALIZED (...)
- Add a covering index for the subquery:
CREATE INDEX IF NOT EXISTS PL_ProjID_InsertDate_Covering ON Project_List (ProjID, InsertDate);
2. Optimize Indexes for Filter and Join Conditions
Step 1: Create a Composite Filtering Index
Add an index on columns used in equality conditions and range filters:
CREATE INDEX IF NOT EXISTS Project_List_Filter_Index
ON Project_List (Active, Target_Go_Live_Date, Status, Progress, BL_Start);
Why This Helps:
- Allows the
WHERE
clause to use the index forActive = 'Yes'
,Target_Go_Live_Date < ...
, and partial filtering onStatus
/Progress
. - Includes
BL_Start
to validateBL_Start IS NOT NULL
andBL_Start != ''
without a table lookup.
Step 2: Add a Covering Index for Common Projections
Since the query selects 15+ columns, create a covering index for the most frequently accessed columns:
CREATE INDEX IF NOT EXISTS Project_List_Covering_Index
ON Project_List (
Active,
Target_Go_Live_Date,
ProjID,
InsertDate
)
INCLUDE (
CID, Project_Name, Start_date, Finish_Date, BL_Start, BL_Finish,
Status, Project_Type, Project_Delivered, Progress, Manager
);
Note: SQLite does not natively support INCLUDE
, but you can append the columns to the index:
CREATE INDEX ... ON Project_List (
Active, Target_Go_Live_Date, ProjID, InsertDate,
CID, Project_Name, ..., Manager
);
Step 3: Remove Redundant Indexes
Drop indexes that overlap with the new composite index, such as Project_List_idx_92398598
(on Active, Target_Go_Live_Date
).
3. Normalize Critical Columns to Reduce Row Size
Step 1: Extract Enumerated Values into Lookup Tables
For columns with repetitive values (e.g., Status
, Progress
, Project_Type
), create reference tables:
CREATE TABLE Project_Status (
StatusID INTEGER PRIMARY KEY,
StatusName TEXT NOT NULL UNIQUE
);
INSERT INTO Project_Status (StatusName)
SELECT DISTINCT Status FROM Project_List;
ALTER TABLE Project_List ADD COLUMN StatusID INTEGER REFERENCES Project_Status(StatusID);
Step 2: Update Queries to Use Normalized Columns
Replace Status != 'Cancelled'
with a join:
SELECT ...
FROM Project_List a
INNER JOIN Project_Status s ON a.StatusID = s.StatusID AND s.StatusName != 'Cancelled';
Why This Helps:
- Reduces storage footprint and index size.
- Enables efficient lookups on
StatusID
instead of string comparisons.
4. Adjust Query Logic to Leverage Index-Friendly Conditions
Rewrite Non-Equality Filters:
Replace a.Progress != 'Completed'
with a.Progress IN ('Pending', 'In Progress', ...)
to allow index usage.
Combine Null/Empty Checks:
Replace a.BL_Start != '' AND a.BL_Start IS NOT NULL
with a.BL_Start > ''
(if BL_Start
is a non-null text field).
5. Analyze and Update Table Statistics
SQLite relies on table statistics to choose optimal query plans. Force a reanalysis:
ANALYZE;
Manually update statistics for critical indexes:
UPDATE sqlite_stat1 SET stat = '10000 500' WHERE idx = 'Project_List_Filter_Index';
6. Consider Partial Indexes for Active Projects
Create an index that only includes active projects:
CREATE INDEX IF NOT EXISTS Project_List_Active_Filter
ON Project_List (Target_Go_Live_Date)
WHERE Active = 'Yes' AND PMO_Board_Report != 'No';
7. Benchmark and Iterate
Test each optimization incrementally:
- Measure baseline performance with the original query.
- Apply the CTE rewrite and compare VM steps/run time.
- Introduce composite indexes and verify query plan changes (
.eqp on
). - Normalize high-impact columns and update queries.
Example benchmark results after optimization:
Step | VM Steps | Run Time (s) |
---|---|---|
Original Query | 1,587,453 | 96.47 |
CTE + Covering Index | 892,101 | 32.15 |
Normalized Schema | 451,230 | 12.89 |
Final Thoughts
While a full schema redesign is ideal, the optimizations above can reduce execution time by 70–80% without structural changes. Prioritize composite indexes, CTE/materialized subqueries, and selective normalization of critical columns. Monitor performance as data grows and consider partitioning the table by InsertDate
if historical data retention policies allow.