Optimizing Slow UPDATE Query with GLOB and SUBSTR in SQLite


Performance Bottlenecks in Correlated Subquery with GLOB and SUBSTR

Issue Overview

The core challenge involves optimizing an UPDATE query that uses a correlated subquery with GLOB and SUBSTR to populate a column (count_yp) in the FA_MOP table. The subquery counts distinct Repere values from the Conduite table that match dynamically generated patterns derived from FA_MOP.Fiches_HTML. The query executes slowly (30 seconds), likely due to inefficient pattern matching, absence of indexing strategies for GLOB, and repeated computations of string operations. Key technical components include:

  1. Correlated Subquery Mechanics: For each row in FA_MOP, the subquery recalculates SUBSTR(f.Fiches_HTML, 1, LENGTH(f.Fiches_HTML)-1) || '[1-9]' and scans Conduite.Repere for matches using GLOB. This results in O(N*M) complexity, where N is the number of rows in FA_MOP and M is the number of rows in Conduite.

  2. Pattern Matching with GLOB: The GLOB operator is inherently slow for patterns with leading wildcards (*), as it cannot leverage indexes. The subquery’s GLOB pattern SUBSTR(...) || '[1-9]' and the WHERE clause’s Fiches_HTML GLOB '*YP[1-9]' force full-table scans.

  3. String Manipulation Overhead: The SUBSTR function dynamically truncates the last character of Fiches_HTML for each row in FA_MOP, creating computational overhead and preventing precomputation or indexing of the base pattern.

  4. Data Distribution: Without knowledge of table sizes or existing indexes, assumptions must be made. If FA_MOP has 10,000 rows and Conduite has 1,000,000 rows, the subquery executes 10,000 times, each scanning 1M rows—resulting in 10 billion operations.

Root Causes of Poor Performance

  1. Non-SARGable GLOB Patterns:

    • The GLOB operator in c.Repere GLOB SUBSTR(...) || '[1-9]' uses a dynamic prefix derived from Fiches_HTML, which prevents SQLite from utilizing indexes on Conduite.Repere.
    • The WHERE clause Fiches_HTML GLOB '*YP[1-9]' contains a leading wildcard, rendering any index on Fiches_HTML ineffective.
  2. Correlated Subquery Execution:

    • The subquery is re-executed for every row in FA_MOP that matches the WHERE clause. This amplifies the cost of pattern matching and string operations.
  3. Lack of Precomputed Values:

    • The truncated Fiches_HTML value (SUBSTR(f.Fiches_HTML, 1, LENGTH(f.Fiches_HTML)-1)) is recalculated repeatedly. SQLite cannot cache or index this value without a generated column.
  4. Missing Indexes:

    • Absence of indexes on Conduite.Repere or FA_MOP.Fiches_HTML forces full-table scans during both the WHERE filter and subquery execution.
  5. DISTINCT in COUNT:

    • COUNT(DISTINCT c.Repere) adds sorting/hashing overhead to eliminate duplicates, which may be unnecessary if Repere is already unique or indexed.

Optimizations and Solutions

Step 1: Schema Modifications for Precomputed Values

1.1 Add Generated Columns
Create a stored generated column to precompute the truncated Fiches_HTML value, eliminating repeated SUBSTR calls:

ALTER TABLE FA_MOP ADD COLUMN Fiches_Base TEXT 
GENERATED ALWAYS AS (SUBSTR(Fiches_HTML, 1, LENGTH(Fiches_HTML)-1)) STORED;

1.2 Index the Generated Column
Create an index on Fiches_Base to accelerate lookups in the WHERE clause and subquery:

CREATE INDEX idx_fa_mop_fiches_base ON FA_MOP(Fiches_Base);

1.3 Index Conduite.Repere
If Repere is frequently queried, add an index to speed up GLOB/LIKE matches:

CREATE INDEX idx_conduite_repere ON Conduite(Repere);
Step 2: Rewrite the Query to Minimize Correlated Execution

2.1 Precompute Patterns and Use a Temporary Table
Extract all unique Fiches_Base values and store them in a temporary table. Join this with Conduite to precompute counts:

CREATE TEMP TABLE FichesPatterns AS
SELECT Fiches_Base, Fiches_Base || '[1-9]' AS Pattern 
FROM FA_MOP 
WHERE Fiches_HTML GLOB '*YP[1-9]';

CREATE TEMP TABLE PatternCounts AS
SELECT fp.Fiches_Base, COUNT(DISTINCT c.Repere) AS count_yp
FROM FichesPatterns fp
LEFT JOIN Conduite c ON c.Repere GLOB fp.Pattern
GROUP BY fp.Fiches_Base;

UPDATE FA_MOP
SET count_yp = (
  SELECT pc.count_yp 
  FROM PatternCounts pc 
  WHERE pc.Fiches_Base = FA_MOP.Fiches_Base
)
WHERE EXISTS (SELECT 1 FROM PatternCounts pc WHERE pc.Fiches_Base = FA_MOP.Fiches_Base);

2.2 Replace GLOB with LIKE Where Possible
If the pattern [1-9] can be expressed as a single character wildcard (_), use LIKE instead:

-- Replace SUBSTR(...) || '[1-9]' with SUBSTR(...) || '_'
WHERE c.Repere LIKE fp.Fiches_Base || '_'

LIKE is often faster than GLOB and can utilize indexes if the pattern is anchored (no leading wildcards).

Step 3: Leverage Indexes with Refined Patterns

3.1 Optimize the WHERE Clause
The original WHERE clause Fiches_HTML GLOB '*YP[1-9]' can be rewritten to use LIKE with a trailing wildcard if the YP[1-9] is a fixed suffix:

WHERE Fiches_HTML LIKE '%YP_'

However, leading wildcards still prevent index usage. If YP[1-9] always occurs at the end of Fiches_HTML, consider adding a reverse index:

ALTER TABLE FA_MOP ADD COLUMN Fiches_Reverse TEXT 
GENERATED ALWAYS AS (REVERSE(Fiches_HTML)) STORED;

CREATE INDEX idx_fa_mop_fiches_reverse ON FA_MOP(Fiches_Reverse);

-- Query using reversed pattern
WHERE Fiches_Reverse GLOB '[1-9]PY*'

3.2 Use Full-Text Search for Fixed Suffixes
If YP[1-9] is a fixed suffix, store it in a separate column and index it:

ALTER TABLE FA_MOP ADD COLUMN Suffix TEXT 
GENERATED ALWAYS AS (SUBSTR(Fiches_HTML, -3)) STORED;

CREATE INDEX idx_fa_mop_suffix ON FA_MOP(Suffix);

-- Query using exact suffix matches
WHERE Suffix GLOB 'YP[1-9]'
Step 4: Analyze and Tune Performance

4.1 Execute ANALYZE
Generate statistics for the query planner to make informed decisions:

ANALYZE;

4.2 Review Query Plans
Use EXPLAIN QUERY PLAN to verify index usage:

EXPLAIN QUERY PLAN
UPDATE FA_MOP ... [rest of the query]

Look for SCAN TABLE Conduite or SCAN TABLE FA_MOP in the output, which indicate full-table scans. Ideally, you should see SEARCH TABLE ... USING INDEX.

4.3 Benchmark Incremental Changes
Test each optimization step independently:

  1. Add generated columns and indexes.
  2. Rewrite the query using temporary tables.
  3. Replace GLOB with LIKE.

Measure execution time after each change to identify the most impactful optimizations.

Step 5: Advanced Optimizations

5.1 Materialized Views
If the data is static, precompute counts and store them in a separate table refreshed periodically:

CREATE TABLE YP_Counts (
  Fiches_Base TEXT PRIMARY KEY,
  count_yp INTEGER
);

INSERT INTO YP_Counts (Fiches_Base, count_yp)
SELECT Fiches_Base, COUNT(DISTINCT Repere) 
FROM Conduite 
WHERE Repere GLOB Fiches_Base || '[1-9]'
GROUP BY Fiches_Base;

-- Update FA_MOP using the materialized view
UPDATE FA_MOP 
SET count_yp = (SELECT count_yp FROM YP_Counts WHERE Fiches_Base = FA_MOP.Fiches_Base);

5.2 Use Application-Side Caching
Offload pattern matching and counting to the application layer, where string operations may be faster, and cache results.

5.3 Normalize the Data Model
If Repere values follow a strict pattern (e.g., BaseYPX), split them into columns:

ALTER TABLE Conduite ADD COLUMN Base TEXT;
ALTER TABLE Conduite ADD COLUMN YP_Number INTEGER;

UPDATE Conduite 
SET 
  Base = SUBSTR(Repere, 1, LENGTH(Repere)-1),
  YP_Number = CAST(SUBSTR(Repere, -1) AS INTEGER)
WHERE Repere GLOB '*[1-9]';

CREATE INDEX idx_conduite_base ON Conduite(Base);

Then rewrite the query to join on Base and filter YP_Number BETWEEN 1 AND 9.


Final Recommendations

  1. Prioritize Generated Columns and Indexes: Precompute Fiches_Base and index it to eliminate redundant SUBSTR calls.
  2. Avoid Correlated Subqueries: Use temporary tables or materialized views to batch-process counts.
  3. Replace GLOB with LIKE: When possible, use simpler patterns that can leverage indexes.
  4. Profile and Iterate: Use EXPLAIN QUERY PLAN and ANALYZE to guide optimizations.

By restructuring the schema, precomputing values, and minimizing correlated subqueries, the query’s execution time can be reduced from 30 seconds to sub-second performance.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *