Optimizing DELETE Performance: Subqueries vs. Joins in SQLite

Understanding Query Execution Patterns in Subquery-Driven DELETE Operations

The core challenge revolves around optimizing a DELETE operation targeting the b_stk table using complex filtering logic involving percentile calculations. Two query variants are compared: one using multiple scalar subqueries in the WHERE clause, and another using JOINs with precomputed percentile values. The discussion highlights confusion about query plan interpretation, subquery execution frequency, and the impact of aggregation strategies on performance. Key technical elements include:

  1. Data Scale:

    • b_stk contains ~1,500 rows
    • Heavy use of custom kthpercentile() C extension for percentile calculations
    • JOIN operations with a_state table (size unspecified)
  2. Structural Differences:

    • Variant 1: Five separate scalar subqueries in WHERE clause for percentile thresholds
    • Variant 2: Five JOINed derived tables calculating percentiles upfront
    • Identical filtering logic for dbuy, f_mcap, p_pv20, sval2, ynow, and yspr
  3. Execution Plan Observations:

    • Variant 1 Query Plan:

      • SCAN b_stk operations for scalar subqueries (Rows 4,9,11,13,15)
      • MATERIALIZE for subquery containing yspr_max calculation
      • Sequential processing of subqueries
    • Variant 2 Query Plan:

      • MATERIALIZE operations for percentile subqueries (Rows 3,5,7,9,11)
      • Subsequent SCAN operations on materialized results
      • Parallel-like access to precomputed values
  4. Critical Unknowns:

    • Index presence on b_stk columns used in percentile calculations
    • Determinism guarantees of kthpercentile() function
    • Cardinality of a_state table and join conditions

Root Causes of Performance Variance in Aggregation Strategies

1. Repeated Full Table Scans from Scalar Subqueries

Each scalar subquery (SELECT kthpercentile(col, ...) FROM b_stk) in Variant 1 forces a full table scan of b_stk:

-- Each of these triggers separate b_stk scan:
AND sval2 < (SELECT kthpercentile(sval2, 0.7500) FROM b_stk)
AND p_pv20 > (SELECT kthpercentile(p_pv20, 0.2500) FROM b_stk)
...
  • No Result Caching: SQLite doesn’t cache subquery results unless explicitly materialized
  • O(n) Complexity: 5 subqueries → 5 full scans → 5×1,500 = 7,500 rows processed
  • Worse with Larger Tables: Linear scaling makes this approach unsustainable for growing data

2. Materialization Overhead in JOIN-Based Approach

Variant 2 uses materialization for percentile calculations:

JOIN (SELECT kthpercentile(dbuy, 0.2500) AS dbuy_min FROM b_stk)
JOIN (SELECT kthpercentile(f_mcap, 0.2500) AS f_mcap_min FROM b_stk)
...
  • Temporary Storage Costs: Each derived table creates a materialized temp object (Rows 3,5,7,9,11 in plan)
  • Optimizer Limitations: SQLite 3.x doesn’t automatically merge multiple single-aggregate subqueries
  • Memory vs. CPU Tradeoff: Reduced table scans at the cost of increased memory allocation

3. Index Utilization Deficiencies

The absence (or presence) of indexes dramatically impacts both approaches:

  • Without Indexes:

    • kthpercentile() requires full column scans regardless of approach
    • JOIN variant wins by scanning once vs. multiple times
  • With Indexes:

    • Descending index on sval2 could make kthpercentile(sval2, 0.75) an O(1) index seek
    • Separate scalar subqueries might outperform materialized JOINs
  • Current Evidence:

    • Query plans show SCAN b_stk (full scans) not SEARCH (indexed access)
    • Implies missing indexes on dbuy, f_mcap, p_pv20, sval2, yspr

4. Function Determinism and Query Optimization

The custom kthpercentile() C extension’s determinism status affects plan caching:

  • Deterministic Declaration:

    sqlite3_create_function(db, "kthpercentile", 2, SQLITE_UTF8, 0, 
                            &kthpercentile_impl, 0, 0);
    

    Missing SQLITE_DETERMINISTIC flag prevents reuse of computed values across subqueries

  • Non-Deterministic Impact:

    • Optimizer assumes different results per invocation → prevents subquery merging
    • Forces re-execution even for identical parameters

5. Join Order and Filter Pushdown

The original queries show suboptimal join sequencing:

FROM a_state
JOIN b_stk
JOIN (SELECT ...)
  • Implicit CROSS JOIN: Missing explicit join conditions create Cartesian products
  • Filter Placement: WHERE clause filters applied late vs. early predicate pushdown
  • Join Collapsing Opportunities: SQLite’s ability to flatten subqueries limited by aggregate functions

Comprehensive Optimization Strategy for Percentile-Based Deletions

1. Consolidate Percentile Calculations

Solution: Single CTE for all percentiles

WITH Percentiles AS (
  SELECT
    kthpercentile(dbuy, 0.2500) AS dbuy_min,
    kthpercentile(f_mcap, 0.2500) AS f_mcap_min,
    kthpercentile(p_pv20, 0.2500) AS p_pv20_min,
    kthpercentile(sval2, 0.7500) AS sval2_max,
    kthpercentile(yspr, 0.5000) AS yspr_max
  FROM b_stk
)
DELETE FROM b_stk
WHERE stkid IN (
  SELECT b.stkid
  FROM b_stk b
  JOIN a_state a ON (...)
  CROSS JOIN Percentiles p
  WHERE NOT (...)
);

Benefits:

  • Single b_stk scan for all aggregates
  • Eliminates 4 redundant table scans
  • Clear separation of data calculation and filtering logic

Execution Plan Improvements:

  • MATERIALIZE 1 (CTE) vs. 5 materializations
  • Single SCAN b_stk in CTE vs. multiple in original

2. Indexing Strategy for Percentile Optimization

Recommended Indexes:

CREATE INDEX idx_b_stk_dbuy ON b_stk(dbuy);
CREATE INDEX idx_b_stk_f_mcap ON b_stk(f_mcap);
CREATE INDEX idx_b_stk_p_pv20 ON b_stk(p_pv20);
CREATE INDEX idx_b_stk_sval2 ON b_stk(sval2);
CREATE INDEX idx_b_stk_yspr ON b_stk(yspr);

Index-Aware kthpercentile() Implementation:
Modify the C extension to leverage indexes when available:

  1. Detect if column has ordered index
  2. If indexed, perform index scan to find k-th percentile directly
  3. Fallback to full scan when no index exists

Example Optimization:
For kthpercentile(sval2, 0.75) with descending index:

  • Navigate to index end (max value)
  • Traverse 25% backward (for 0.75 percentile)
  • O(1) time vs. O(n) full scan

3. Deterministic Function Registration

Modify function registration to enable optimizer caching:

sqlite3_create_function(db, "kthpercentile", 2, 
                        SQLITE_UTF8 | SQLITE_DETERMINISTIC, 
                        0, &kthpercentile_impl, 0, 0);

Effects:

  • Allows SQLite to reuse identical kthpercentile() calls
  • Enables subquery merging and common expression elimination

4. Join Condition Analysis and Optimization

Explicit Join Conditions:
Original implicit joins risk Cartesian products. Refactor to:

FROM a_state a
JOIN b_stk b ON a.stkid = b.stkid
CROSS JOIN (
  SELECT ... 
  FROM b_stk
) p

Predicate Pushdown:
Move filters closer to data sources:

SELECT stkid 
FROM b_stk b
JOIN a_state a ON b.stkid = a.stkid 
  AND ynow < 275.00  -- Push to join condition
  AND (0 < f_pe AND f_pe < 64) 
...

Benefits:

  • Reduces rows joined early in execution plan
  • Minimizes data movement between query stages

5. Temporary Table Materialization Control

Control materialization with pragmas:

PRAGMA temp_store = MEMORY;  -- Faster than disk-based temps
PRAGMA cache_size = -10000;  -- 10MB cache

Materialization Threshold Tuning:
For SQLite versions ≥3.24, use MATERIALIZED/NOT MATERIALIZED hints:

WITH Percentiles AS MATERIALIZED (...)

Force materialization when beneficial, prevent when not needed.

6. Query Plan Analysis Methodology

Step-by-Step Plan Interpretation:

  1. Identify SCAN vs SEARCH operations

    • SCAN b_stk = Full table scan (bad)
    • SEARCH ... USING INDEX = Index utilization (good)
  2. Count materializations (MATERIALIZE in plan)

    • Each materialization = temp table creation
  3. Analyze subquery depth:

    • LIST SUBQUERY X indicates depth of nesting
    • Deeper nesting often correlates with higher memory use

Original Query Plan Remediation:

  • Variant 1:
    SCAN b_stk in scalar subqueries → Consolidate to 1 scan via CTE

  • Variant 2:
    MATERIALIZE for percentile subqueries → Replace with single CTE

7. Statistical Optimization with ANALYZE

Generate statistics for query planner:

ANALYZE;
CREATE STATISTICS s_b_stk_dbuy ON b_stk(dbuy);
...

Enable Cost-Based Optimization:

  • Helps planner choose between index vs. scan
  • Improves join order decisions

8. Partial Indexes for Common Filter Patterns

For frequently used filters:

CREATE INDEX idx_b_stk_active ON b_stk(stkid)
WHERE ynow < 275.00 
  AND (0 < f_pe AND f_pe < 64);

Benefits:

  • Smaller index size
  • Automatic filter application during scans

9. Batch Deletion for Large Operations

For very large b_stk tables, chunk deletions:

DELETE FROM b_stk 
WHERE stkid IN (...)
LIMIT 1000;

Repeat Until Affected Rows = 0
Advantages:

  • Reduces transaction lock contention
  • Allows incremental progress monitoring

10. Profiling with Application-Defined Functions

Instrument kthpercentile() for performance telemetry:

static void kthpercentile_impl(
  sqlite3_context *context,
  int argc,
  sqlite3_value **argv
){
  clock_t start = clock();
  // ... existing logic ...
  sqlite3_result_double(context, result);
  log_time(clock() - start);  // Custom timing logger
}

Metrics to Capture:

  • Invocation count per query
  • Execution time distribution
  • Index vs. scan utilization rates

11. Alternative Storage Models

If b_stk is static during deletions, consider:

  1. Write-Ahead Logging (WAL) Mode:

    PRAGMA journal_mode = WAL;
    
    • Allows concurrent reads during DELETE
  2. Temporary Shadow Table:

    CREATE TEMP TABLE to_delete AS 
    SELECT stkid FROM ...;  -- Complex logic here
    
    DELETE FROM b_stk 
    WHERE stkid IN (SELECT stkid FROM to_delete);
    
    • Separates identification from deletion
    • Enables index optimization on temp table

12. Parameterized Percentile Thresholds

For adjustable percentiles without query rewrite:

CREATE TABLE config (
  threshold_name TEXT PRIMARY KEY,
  percentile REAL
);

INSERT INTO config VALUES 
  ('dbuy', 0.25),
  ('f_mcap', 0.25),
  ...;
  
WITH Percentiles AS (
  SELECT
    kthpercentile(dbuy, (SELECT percentile FROM config WHERE threshold_name='dbuy')) AS dbuy_min,
    ...
)
...

Benefits:

  • Centralized threshold management
  • No SQL text modification needed for adjustments

13. Query Plan Forcing with SQLITE_ENABLE_STAT4

Compile SQLite with advanced statistics:

./configure --enable-stat4

Enables:

  • Histogram-based cost estimation
  • Better index selectivity calculations
  • Improved join order decisions

14. Function Inlining for Single-Use Aggregates

If kthpercentile() is only used here, consider SQL-based percentile calculation:

-- For small tables, SQL implementation may suffice
SELECT dbuy FROM b_stk ORDER BY dbuy LIMIT 1 OFFSET (
  SELECT CAST(COUNT(*)*0.25 AS INTEGER) FROM b_stk
)

Tradeoffs:

  • Avoids C extension complexity
  • May be slower for large datasets

15. Concurrency Considerations During Deletion

Locking Behavior:

  • DELETE operations acquire RESERVED locks
  • Long-running deletions block writers

Mitigation Strategies:

  1. Split Transaction:

    BEGIN;
    DELETE ... LIMIT 1000;
    COMMIT;
    

    Repeat until completion

  2. Vacuum Automation:
    Post-deletion VACUUM to reclaim space

  3. Trigger-Based Batch Processing:
    Use AFTER DELETE triggers to process in chunks

Final Recommendations

  1. Immediate Action:

    • Consolidate percentile subqueries into single CTE
    • Add indexes on filtered/aggregated columns
    • Declare kthpercentile() as deterministic
  2. Medium-Term Improvements:

    • Implement index-aware percentile calculations
    • Generate statistics with ANALYZE
    • Convert to parameterized threshold configuration
  3. Long-Term Strategy:

    • Migrate to in-memory temp tables for percentile staging
    • Implement batch deletion with progress tracking
    • Add comprehensive query plan monitoring

By systematically addressing materialization overhead, index utilization, and function determinism, the DELETE operation can achieve O(1) table scans with subquery execution time reduced by 80% (from 5 scans to 1). Continuous monitoring through query plan analysis and function profiling ensures sustained performance as data scales.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *