Index Not Used for JSON_EXTRACT with Subquery in SQLite


Understanding Index Usage Discrepancies in JSON_EXTRACT Queries

The core issue revolves around inconsistent index utilization in SQLite when querying JSON data stored in a text column. Specifically, an index created on the result of JSON_EXTRACT(doc, '$.categoryId') is used when filtering with an explicit list of values (e.g., IN (1, 2)), but not when filtering with a subquery that retrieves values from another table (e.g., IN (SELECT id FROM categories)). This behavior contradicts expectations, as developers often assume equivalent optimization for both forms of IN clauses.

Key Components of the Schema

  • categories Table: Stores category IDs with an integer primary key.
  • todos Table: Contains a doc column storing JSON data, including a categoryId field.
  • Index Definition:
    CREATE INDEX todos_doc_category_id ON todos (JSON_EXTRACT(doc, '$.categoryId'));
    

    This index is designed to accelerate queries filtering on doc->>'$.categoryId'.

Observed Behavior

  1. Literal List Query:

    SELECT * FROM todos 
    WHERE json_extract(todos.doc, '$.categoryId') IN (1, 2);
    
    • Execution Plan: Uses index todos_doc_category_id via SEARCH operation.
    • Efficiency: Direct lookup using indexed values.
  2. Subquery-Based Filter:

    SELECT * FROM todos 
    WHERE json_extract(todos.doc, '$.categoryId') IN (SELECT id FROM categories);
    
    • Execution Plan: Full table scan (SCAN todos) with rowid-based search on categories.
    • Inefficiency: Index is ignored despite identical logical intent.

Underlying Mechanics of SQLite’s Query Planner

SQLite’s query planner evaluates multiple strategies for executing a query, including:

  • Index Scans: Leveraging B-tree structures for rapid value retrieval.
  • Full Scans: Iterating through all rows when index use is deemed inefficient.
  • Correlated vs. Uncorrelated Subqueries: Deciding whether to treat subqueries as static lists or dynamic filters.

The discrepancy arises from how SQLite optimizes IN clauses:

  • Literal Lists: Treated as fixed datasets. The planner generates a temporary in-memory index, enabling efficient index-driven lookups.
  • Subqueries: May trigger correlation optimizations, transforming IN into an EXISTS-style predicate. This alters execution order and invalidates assumptions about index applicability.

Factors Influencing Index Selection in Subquery vs. Literal List Contexts

1. Optimization of IN Clauses to EXISTS Correlated Subqueries

When a subquery is used in an IN clause, SQLite may rewrite it as:

SELECT * FROM todos 
WHERE EXISTS (
  SELECT 1 FROM categories 
  WHERE categories.id = json_extract(todos.doc, '$.categoryId')
);

This transformation changes the execution dynamics:

  • Outer Loop: Iterates over todos rows.
  • Inner Loop: Performs a rowid lookup on categories for each todos row.

Why the Index Is Ignored:

  • The index todos_doc_category_id stores precomputed json_extract(doc, '$.categoryId') values.
  • However, the rewritten query requires comparing each todos row’s JSON-extracted value against the categories table. Since categories.id is indexed (as a primary key), SQLite prioritizes scanning todos and using categories’s index for the inner loop.

2. Cost Estimation and Temporary Index Creation

SQLite’s planner uses heuristic cost models to choose execution plans:

  • Literal List (IN (1, 2)):

    • The list (1, 2) is small and known at compile-time.
    • A temporary index is created for these values, allowing the planner to use the todos_doc_category_id index for efficient lookups.
  • Subquery (IN (SELECT ...)):

    • The subquery’s result set size is unknown until runtime.
    • Creating a temporary index for a potentially large result set is cost-prohibitive.
    • The planner defaults to a nested loop join, scanning todos and using categories’s primary key for quick existence checks.

3. Impact of JSON Function Usage on Index Eligibility

Indexes on JSON-extracted values have inherent limitations:

  • Runtime Evaluation: JSON_EXTRACT() is evaluated during query execution, not at index creation. While the index stores precomputed results, the planner may still deem index access less efficient than a full scan if:
    • The number of matching rows is high.
    • The subquery’s result set is large.
  • Data Type Mismatches: JSON extracts return TEXT values by default. If categories.id is INTEGER, implicit type conversions may prevent index usage.

4. Query Structure and Planner Assumptions

The planner assumes that:

  • Small Tables: Full scans are acceptable for small datasets (e.g., categories with few rows).
  • Index Selectivity: Indexes are only beneficial if they significantly reduce the search space. If the todos table is small, a full scan might be faster than index lookups.

Resolving Index Utilization Issues in JSON-Based Queries

Step 1: Diagnose Query Plan Choices

Use EXPLAIN QUERY PLAN to analyze execution strategies:

EXPLAIN QUERY PLAN
SELECT * FROM todos 
WHERE json_extract(todos.doc, '$.categoryId') IN (SELECT id FROM categories);
  • Output Interpretation:
    • USING ROWID SEARCH ON TABLE categories: Indicates a correlated subquery using categories.id index.
    • SCAN todos: Full scan of todos, implying the JSON index is unused.

Step 2: Disable Subquery Flattening

Force the subquery to materialize its results, enabling temporary index creation:

SELECT * FROM todos 
WHERE json_extract(todos.doc, '$.categoryId') IN (SELECT +id FROM categories);
  • +id Hack: The unary + operator prevents the subquery from being flattened into a correlated join. This forces SQLite to:
    1. Materialize SELECT id FROM categories into a temporary table.
    2. Build a temporary index on the materialized results.
    3. Use the todos_doc_category_id index for efficient lookups.

Step 3: Use Explicit Index Hints

Override the planner’s decision with INDEXED BY:

SELECT * FROM todos INDEXED BY todos_doc_category_id
WHERE json_extract(todos.doc, '$.categoryId') IN (SELECT id FROM categories);
  • Risks:
    • May degrade performance if the index is unsuitable for the data distribution.
    • Requires manual validation of index effectiveness.

Step 4: Normalize JSON Data into Relational Columns

Store categoryId in a separate indexed column:

ALTER TABLE todos ADD COLUMN category_id INTEGER;
CREATE INDEX todos_category_id ON todos(category_id);

Update the column automatically using triggers:

CREATE TRIGGER update_todos_category_id 
AFTER INSERT ON todos 
BEGIN
  UPDATE todos 
  SET category_id = json_extract(doc, '$.categoryId') 
  WHERE id = NEW.id;
END;
  • Benefits:
    • Eliminates JSON function overhead.
    • Enables direct index usage without planner ambiguities.

Step 5: Use Materialized Views for JSON Data

Create a view that materializes JSON extracts:

CREATE VIEW todos_with_category AS
SELECT id, json_extract(doc, '$.categoryId') AS category_id, doc
FROM todos;

Query the materialized view:

SELECT * FROM todos_with_category 
WHERE category_id IN (SELECT id FROM categories);
  • Trade-offs:
    • Increased storage for materialized data.
    • Requires refresh mechanisms if doc changes.

Step 6: Adjust Query Structure for Planner Friendliness

Rewrite the query to use a JOIN instead of IN:

SELECT todos.* 
FROM todos 
JOIN categories 
  ON categories.id = json_extract(todos.doc, '$.categoryId');
  • Planner Behavior:
    • Treats categories as the outer loop.
    • Uses categories.id index for lookups into todos.

Step 7: Update Statistics for Cost-Based Optimizations

Run ANALYZE to refresh table statistics:

ANALYZE;
  • Impact:
    • Helps the planner make informed decisions about index vs. scan efficiency.
    • Particularly useful after significant data changes.

Step 8: Consider JSON1 Extension Alternatives

Use json_tree() for structured JSON queries:

SELECT todos.* 
FROM todos, json_tree(todos.doc) 
WHERE json_tree.key = 'categoryId' 
  AND json_tree.value IN (SELECT id FROM categories);
  • Advantages:
    • Leverages JSON-specific optimizations.
    • May bypass planner limitations with scalar extracts.

Step 9: Evaluate Schema Design Trade-Offs

  • Hybrid JSON/Relational Models:
    • Store frequently queried fields in relational columns.
    • Reserve JSON for unstructured data.
  • Full-Text Search Alternatives:
    • Use fts5 virtual tables for JSON content requiring complex queries.

Step 10: Monitor Planner Decisions with Runtime Flags

Enable SQLite’s debugging features to log planner decisions:

PRAGMA compile_options;
-- Ensure 'ENABLE_EXPLAIN_COMMENTS' is enabled
EXPLAIN QUERY PLAN 
SELECT ...;
  • Output Analysis:
    • Look for comments indicating why indexes were rejected.
    • Example: /* NOT INDEXED: json_extract(...) */.

By systematically addressing the interplay between JSON function usage, index eligibility, and query planner behavior, developers can enforce index utilization in subquery-based IN clauses while maintaining schema flexibility. The choice of solution depends on factors like data volatility, query frequency, and performance requirements.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *