Optimizing Slow UPDATE Query with GLOB and SUBSTR in SQLite
Performance Bottlenecks in Correlated Subquery with GLOB and SUBSTR
Issue Overview
The core challenge involves optimizing an UPDATE
query that uses a correlated subquery with GLOB
and SUBSTR
to populate a column (count_yp
) in the FA_MOP
table. The subquery counts distinct Repere
values from the Conduite
table that match dynamically generated patterns derived from FA_MOP.Fiches_HTML
. The query executes slowly (30 seconds), likely due to inefficient pattern matching, absence of indexing strategies for GLOB
, and repeated computations of string operations. Key technical components include:
Correlated Subquery Mechanics: For each row in
FA_MOP
, the subquery recalculatesSUBSTR(f.Fiches_HTML, 1, LENGTH(f.Fiches_HTML)-1) || '[1-9]'
and scansConduite.Repere
for matches usingGLOB
. This results in O(N*M) complexity, where N is the number of rows inFA_MOP
and M is the number of rows inConduite
.Pattern Matching with
GLOB
: TheGLOB
operator is inherently slow for patterns with leading wildcards (*
), as it cannot leverage indexes. The subquery’sGLOB
patternSUBSTR(...) || '[1-9]'
and theWHERE
clause’sFiches_HTML GLOB '*YP[1-9]'
force full-table scans.String Manipulation Overhead: The
SUBSTR
function dynamically truncates the last character ofFiches_HTML
for each row inFA_MOP
, creating computational overhead and preventing precomputation or indexing of the base pattern.Data Distribution: Without knowledge of table sizes or existing indexes, assumptions must be made. If
FA_MOP
has 10,000 rows andConduite
has 1,000,000 rows, the subquery executes 10,000 times, each scanning 1M rows—resulting in 10 billion operations.
Root Causes of Poor Performance
Non-SARGable
GLOB
Patterns:- The
GLOB
operator inc.Repere GLOB SUBSTR(...) || '[1-9]'
uses a dynamic prefix derived fromFiches_HTML
, which prevents SQLite from utilizing indexes onConduite.Repere
. - The
WHERE
clauseFiches_HTML GLOB '*YP[1-9]'
contains a leading wildcard, rendering any index onFiches_HTML
ineffective.
- The
Correlated Subquery Execution:
- The subquery is re-executed for every row in
FA_MOP
that matches theWHERE
clause. This amplifies the cost of pattern matching and string operations.
- The subquery is re-executed for every row in
Lack of Precomputed Values:
- The truncated
Fiches_HTML
value (SUBSTR(f.Fiches_HTML, 1, LENGTH(f.Fiches_HTML)-1)
) is recalculated repeatedly. SQLite cannot cache or index this value without a generated column.
- The truncated
Missing Indexes:
- Absence of indexes on
Conduite.Repere
orFA_MOP.Fiches_HTML
forces full-table scans during both theWHERE
filter and subquery execution.
- Absence of indexes on
DISTINCT in COUNT:
COUNT(DISTINCT c.Repere)
adds sorting/hashing overhead to eliminate duplicates, which may be unnecessary ifRepere
is already unique or indexed.
Optimizations and Solutions
Step 1: Schema Modifications for Precomputed Values
1.1 Add Generated Columns
Create a stored generated column to precompute the truncated Fiches_HTML
value, eliminating repeated SUBSTR
calls:
ALTER TABLE FA_MOP ADD COLUMN Fiches_Base TEXT
GENERATED ALWAYS AS (SUBSTR(Fiches_HTML, 1, LENGTH(Fiches_HTML)-1)) STORED;
1.2 Index the Generated Column
Create an index on Fiches_Base
to accelerate lookups in the WHERE
clause and subquery:
CREATE INDEX idx_fa_mop_fiches_base ON FA_MOP(Fiches_Base);
1.3 Index Conduite.Repere
If Repere
is frequently queried, add an index to speed up GLOB
/LIKE
matches:
CREATE INDEX idx_conduite_repere ON Conduite(Repere);
Step 2: Rewrite the Query to Minimize Correlated Execution
2.1 Precompute Patterns and Use a Temporary Table
Extract all unique Fiches_Base
values and store them in a temporary table. Join this with Conduite
to precompute counts:
CREATE TEMP TABLE FichesPatterns AS
SELECT Fiches_Base, Fiches_Base || '[1-9]' AS Pattern
FROM FA_MOP
WHERE Fiches_HTML GLOB '*YP[1-9]';
CREATE TEMP TABLE PatternCounts AS
SELECT fp.Fiches_Base, COUNT(DISTINCT c.Repere) AS count_yp
FROM FichesPatterns fp
LEFT JOIN Conduite c ON c.Repere GLOB fp.Pattern
GROUP BY fp.Fiches_Base;
UPDATE FA_MOP
SET count_yp = (
SELECT pc.count_yp
FROM PatternCounts pc
WHERE pc.Fiches_Base = FA_MOP.Fiches_Base
)
WHERE EXISTS (SELECT 1 FROM PatternCounts pc WHERE pc.Fiches_Base = FA_MOP.Fiches_Base);
2.2 Replace GLOB
with LIKE
Where Possible
If the pattern [1-9]
can be expressed as a single character wildcard (_
), use LIKE
instead:
-- Replace SUBSTR(...) || '[1-9]' with SUBSTR(...) || '_'
WHERE c.Repere LIKE fp.Fiches_Base || '_'
LIKE
is often faster than GLOB
and can utilize indexes if the pattern is anchored (no leading wildcards).
Step 3: Leverage Indexes with Refined Patterns
3.1 Optimize the WHERE
Clause
The original WHERE
clause Fiches_HTML GLOB '*YP[1-9]'
can be rewritten to use LIKE
with a trailing wildcard if the YP[1-9]
is a fixed suffix:
WHERE Fiches_HTML LIKE '%YP_'
However, leading wildcards still prevent index usage. If YP[1-9]
always occurs at the end of Fiches_HTML
, consider adding a reverse index:
ALTER TABLE FA_MOP ADD COLUMN Fiches_Reverse TEXT
GENERATED ALWAYS AS (REVERSE(Fiches_HTML)) STORED;
CREATE INDEX idx_fa_mop_fiches_reverse ON FA_MOP(Fiches_Reverse);
-- Query using reversed pattern
WHERE Fiches_Reverse GLOB '[1-9]PY*'
3.2 Use Full-Text Search for Fixed Suffixes
If YP[1-9]
is a fixed suffix, store it in a separate column and index it:
ALTER TABLE FA_MOP ADD COLUMN Suffix TEXT
GENERATED ALWAYS AS (SUBSTR(Fiches_HTML, -3)) STORED;
CREATE INDEX idx_fa_mop_suffix ON FA_MOP(Suffix);
-- Query using exact suffix matches
WHERE Suffix GLOB 'YP[1-9]'
Step 4: Analyze and Tune Performance
4.1 Execute ANALYZE
Generate statistics for the query planner to make informed decisions:
ANALYZE;
4.2 Review Query Plans
Use EXPLAIN QUERY PLAN
to verify index usage:
EXPLAIN QUERY PLAN
UPDATE FA_MOP ... [rest of the query]
Look for SCAN TABLE Conduite
or SCAN TABLE FA_MOP
in the output, which indicate full-table scans. Ideally, you should see SEARCH TABLE ... USING INDEX
.
4.3 Benchmark Incremental Changes
Test each optimization step independently:
- Add generated columns and indexes.
- Rewrite the query using temporary tables.
- Replace
GLOB
withLIKE
.
Measure execution time after each change to identify the most impactful optimizations.
Step 5: Advanced Optimizations
5.1 Materialized Views
If the data is static, precompute counts and store them in a separate table refreshed periodically:
CREATE TABLE YP_Counts (
Fiches_Base TEXT PRIMARY KEY,
count_yp INTEGER
);
INSERT INTO YP_Counts (Fiches_Base, count_yp)
SELECT Fiches_Base, COUNT(DISTINCT Repere)
FROM Conduite
WHERE Repere GLOB Fiches_Base || '[1-9]'
GROUP BY Fiches_Base;
-- Update FA_MOP using the materialized view
UPDATE FA_MOP
SET count_yp = (SELECT count_yp FROM YP_Counts WHERE Fiches_Base = FA_MOP.Fiches_Base);
5.2 Use Application-Side Caching
Offload pattern matching and counting to the application layer, where string operations may be faster, and cache results.
5.3 Normalize the Data Model
If Repere
values follow a strict pattern (e.g., BaseYPX
), split them into columns:
ALTER TABLE Conduite ADD COLUMN Base TEXT;
ALTER TABLE Conduite ADD COLUMN YP_Number INTEGER;
UPDATE Conduite
SET
Base = SUBSTR(Repere, 1, LENGTH(Repere)-1),
YP_Number = CAST(SUBSTR(Repere, -1) AS INTEGER)
WHERE Repere GLOB '*[1-9]';
CREATE INDEX idx_conduite_base ON Conduite(Base);
Then rewrite the query to join on Base
and filter YP_Number BETWEEN 1 AND 9
.
Final Recommendations
- Prioritize Generated Columns and Indexes: Precompute
Fiches_Base
and index it to eliminate redundantSUBSTR
calls. - Avoid Correlated Subqueries: Use temporary tables or materialized views to batch-process counts.
- Replace
GLOB
withLIKE
: When possible, use simpler patterns that can leverage indexes. - Profile and Iterate: Use
EXPLAIN QUERY PLAN
andANALYZE
to guide optimizations.
By restructuring the schema, precomputing values, and minimizing correlated subqueries, the query’s execution time can be reduced from 30 seconds to sub-second performance.