Sudden Query Performance Drop When Exceeding Specific IN Clause Size in SQLite


Threshold-Based Query Plan Degradation in Virtual Table JOIN Operations


Unexpected Execution Plan Shifts with Large IN Lists and Virtual Tables

The core issue revolves around a dramatic performance degradation in SQLite queries when the number of elements in an IN clause crosses specific thresholds (e.g., 26623 vs. 26624 or 3071 vs. 3072 IDs). This manifests as a 100x slowdown in query execution time. The problem is exacerbated when interacting with custom virtual tables like search_params, which parse HTTP query strings into name/value pairs. Key observations include:

  1. Threshold-Specific Degradation: Query execution time spikes abruptly at specific IN list sizes (e.g., 26624 IDs taking 50s vs. 26623 IDs at 500ms).
  2. Virtual Table Interaction: The slowdown is tied to how SQLite’s query planner (NGQP) interacts with virtual tables like search_params, particularly in JOIN operations.
  3. Workaround Efficacy: Rewriting the IN list using json_each('[1,2,3,...]') avoids the slowdown, suggesting a fundamental difference in how SQLite processes literal lists versus subqueries.

Query Planner Misestimations and Temporary Table Handling

The root cause lies in SQLite’s cost-based query planner (NGQP) and its handling of IN lists during query optimization. Specific factors include:

  1. Temporary Table Generation for IN Lists:

    • SQLite converts WHERE id IN (1,2,3,...) into an ephemeral table. For small lists, this table is optimized as a VALUES clause or indexed lookup. For large lists, SQLite may:
      • Fail to materialize an efficient index.
      • Misjudge the cost of scanning the list versus alternative access paths.
    • The threshold (e.g., 3072 elements) corresponds to internal heuristics where SQLite switches from using an index-backed temporary table to a full scan.
  2. Virtual Table Cost Misreporting:

    • Custom virtual tables like search_params must implement the xBestIndex method to provide row count and cost estimates. Inaccurate estimates here can:
      • Skew the planner’s choice of JOIN order.
      • Force nested loop joins instead of hash joins or batch scans.
    • Example: If search_params underestimates its output rows, the planner may prioritize it as the outer loop, leading to Cartesian explosions when combined with a large IN list.
  3. Index Utilization Failures:

    • The flows.id column may lack an index, forcing full table scans when the IN list exceeds a size where SQLite deems index lookups impractical.
    • The presence of a virtual table in the FROM clause can disrupt index selection logic, especially if the virtual table’s metadata (e.g., row uniqueness) is misreported.
  4. Query Plan Stability:

    • Small changes in input size (e.g., adding one ID to the IN list) can trigger a "phase change" in the query planner’s cost calculations, leading to entirely different execution plans.

Diagnostic Procedures and Query Plan Correction Strategies

Step 1: Analyze Query Plans with EXPLAIN and EXPLAIN QUERY PLAN
Compare the output of EXPLAIN QUERY PLAN for the slow and fast variants:

-- Original slow query
EXPLAIN QUERY PLAN
SELECT params.name AS name, json_group_array(DISTINCT params.value) AS "values"
FROM view_requests AS req, search_params(search) AS params
JOIN flows ON flows.request_id = req.id
WHERE flows.id IN (1,2,3,...,26624)
GROUP BY params.name
ORDER BY json_array_length("values") DESC, params.name ASC;

-- Fast workaround using json_each
EXPLAIN QUERY PLAN
SELECT params.name AS name, json_group_array(DISTINCT params.value) AS "values"
FROM view_requests AS req, search_params(search) AS params
JOIN flows ON flows.request_id = req.id
WHERE flows.id IN (SELECT value FROM json_each('[1,2,3,...,26624]'))
GROUP BY params.name
ORDER BY json_array_length("values") DESC, params.name ASC;

Key Differences to Look For:

  • Join Order: Does the slow query process the IN list first (flows table) or the virtual table (search_params)? The latter can cause O(n²) scaling.
  • Temporary Table Indexing: The fast query using json_each likely triggers an AUTOINDEX on the ephemeral table, enabling efficient lookups. The slow query may scan the list as an unindexed temp table.
  • Virtual Table Filter Pushdown: Ensure search_params constraints are applied early. Misimplemented xFilter methods can delay predicate evaluation.

Step 2: Audit Virtual Table xBestIndex Implementation
The search_params virtual table must accurately report:

  • Estimated Rows: Overrides via sqlite3_index_info::nRow to match realistic output sizes (e.g., average URL parameters per request).
  • Cost Estimates: Set sqlite3_index_info::estimatedCost to reflect processing effort relative to native tables. Underestimating costs can trick the planner into favoring nested loops over batch scans.
  • Constraint Usage: Declare which virtual table columns are constrained by the query (e.g., search parameter parsing).

Step 3: Force Join Order with CROSS JOIN
Rewrite the query to materialize the IN list as a Common Table Expression (CTE) and force its processing order:

WITH ids(id) AS (SELECT value FROM json_each('[1,2,3,...,26624]'))
SELECT params.name AS name, json_group_array(DISTINCT params.value) AS "values"
FROM ids
CROSS JOIN flows ON flows.id = ids.id
JOIN view_requests AS req ON flows.request_id = req.id
JOIN search_params(search) AS params
GROUP BY params.name
ORDER BY json_array_length("values") DESC, params.name ASC;

This ensures ids is processed first, leveraging auto-indexing and avoiding Cartesian products with the virtual table.

Step 4: Indexing and Schema Adjustments

  • Add an index on flows.id (if missing) to support rapid lookups:
    CREATE INDEX idx_flows_id ON flows(id);
    
  • If flows.request_id lacks an index, create one to accelerate JOINs with view_requests:
    CREATE INDEX idx_flows_request_id ON flows(request_id);
    

Step 5: Recompile SQLite with Diagnostic Features
Enable the WHERETRACE and SQLITE_ENABLE_STMT_SCANSTATUS compile-time options to log query planner decisions:

# Rebuild better-sqlite3 with debugging flags
npm rebuild better-sqlite3 --build-from-source --sqlite_enable_stmt_scanstatus=1 --sqlite_enable_wheretrace=1

Use .wheretrace in the SQLite shell to visualize cost calculations and plan selection.

Step 6: Parameterize IN Lists via External Binding
Avoid literal IN lists entirely by binding parameters dynamically. For example, in Node.js with better-sqlite3:

const ids = [1, 2, 3, ..., 26624];
const placeholders = ids.map(() => '?').join(',');
const stmt = db.prepare(`
  SELECT params.name AS name, json_group_array(DISTINCT params.value) AS "values"
  FROM view_requests AS req, search_params(search) AS params
  JOIN flows ON flows.request_id = req.id
  WHERE flows.id IN (${placeholders})
  GROUP BY params.name
  ORDER BY json_array_length("values") DESC, params.name ASC
`);
stmt.all(...ids);

This bypasses literal list parsing and leverages SQLite’s native parameter handling, which is more robust for large datasets.

Step 7: Profile Virtual Table Performance
Isolate the virtual table’s impact by benchmarking it separately:

-- Test search_params parsing speed
EXPLAIN QUERY PLAN SELECT * FROM search_params('?foo=bar&baz=qux');
-- Compare with a native table of similar structure
CREATE TABLE mock_search_params(name TEXT, value TEXT);
EXPLAIN QUERY PLAN SELECT * FROM mock_search_params;

If the virtual table is significantly slower, optimize its xFilter and xNext methods to minimize computational overhead.


Final Recommendations

  • Adopt json_each Workaround: Use it universally for IN lists to ensure consistent auto-indexing.
  • Refine Virtual Table Metadata: Ensure xBestIndex returns accurate row counts and costs.
  • Monitor Plan Stability: Use EXPLAIN QUERY PLAN during testing to catch regressions early.
  • Upgrade SQLite: Newer versions (3.37.0+) include optimizer improvements for large IN lists.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *