Index Not Used for JSON_EXTRACT with Subquery in SQLite
Understanding Index Usage Discrepancies in JSON_EXTRACT Queries
The core issue revolves around inconsistent index utilization in SQLite when querying JSON data stored in a text column. Specifically, an index created on the result of JSON_EXTRACT(doc, '$.categoryId') is used when filtering with an explicit list of values (e.g., IN (1, 2)), but not when filtering with a subquery that retrieves values from another table (e.g., IN (SELECT id FROM categories)). This behavior contradicts expectations, as developers often assume equivalent optimization for both forms of IN clauses.
Key Components of the Schema
categoriesTable: Stores category IDs with an integer primary key.todosTable: Contains adoccolumn storing JSON data, including acategoryIdfield.- Index Definition:
CREATE INDEX todos_doc_category_id ON todos (JSON_EXTRACT(doc, '$.categoryId'));This index is designed to accelerate queries filtering on
doc->>'$.categoryId'.
Observed Behavior
-
Literal List Query:
SELECT * FROM todos WHERE json_extract(todos.doc, '$.categoryId') IN (1, 2);- Execution Plan: Uses index
todos_doc_category_idviaSEARCHoperation. - Efficiency: Direct lookup using indexed values.
- Execution Plan: Uses index
-
Subquery-Based Filter:
SELECT * FROM todos WHERE json_extract(todos.doc, '$.categoryId') IN (SELECT id FROM categories);- Execution Plan: Full table scan (
SCAN todos) with rowid-based search oncategories. - Inefficiency: Index is ignored despite identical logical intent.
- Execution Plan: Full table scan (
Underlying Mechanics of SQLite’s Query Planner
SQLite’s query planner evaluates multiple strategies for executing a query, including:
- Index Scans: Leveraging B-tree structures for rapid value retrieval.
- Full Scans: Iterating through all rows when index use is deemed inefficient.
- Correlated vs. Uncorrelated Subqueries: Deciding whether to treat subqueries as static lists or dynamic filters.
The discrepancy arises from how SQLite optimizes IN clauses:
- Literal Lists: Treated as fixed datasets. The planner generates a temporary in-memory index, enabling efficient index-driven lookups.
- Subqueries: May trigger correlation optimizations, transforming
INinto anEXISTS-style predicate. This alters execution order and invalidates assumptions about index applicability.
Factors Influencing Index Selection in Subquery vs. Literal List Contexts
1. Optimization of IN Clauses to EXISTS Correlated Subqueries
When a subquery is used in an IN clause, SQLite may rewrite it as:
SELECT * FROM todos
WHERE EXISTS (
SELECT 1 FROM categories
WHERE categories.id = json_extract(todos.doc, '$.categoryId')
);
This transformation changes the execution dynamics:
- Outer Loop: Iterates over
todosrows. - Inner Loop: Performs a rowid lookup on
categoriesfor eachtodosrow.
Why the Index Is Ignored:
- The index
todos_doc_category_idstores precomputedjson_extract(doc, '$.categoryId')values. - However, the rewritten query requires comparing each
todosrow’s JSON-extracted value against thecategoriestable. Sincecategories.idis indexed (as a primary key), SQLite prioritizes scanningtodosand usingcategories’s index for the inner loop.
2. Cost Estimation and Temporary Index Creation
SQLite’s planner uses heuristic cost models to choose execution plans:
-
Literal List (
IN (1, 2)):- The list
(1, 2)is small and known at compile-time. - A temporary index is created for these values, allowing the planner to use the
todos_doc_category_idindex for efficient lookups.
- The list
-
Subquery (
IN (SELECT ...)):- The subquery’s result set size is unknown until runtime.
- Creating a temporary index for a potentially large result set is cost-prohibitive.
- The planner defaults to a nested loop join, scanning
todosand usingcategories’s primary key for quick existence checks.
3. Impact of JSON Function Usage on Index Eligibility
Indexes on JSON-extracted values have inherent limitations:
- Runtime Evaluation:
JSON_EXTRACT()is evaluated during query execution, not at index creation. While the index stores precomputed results, the planner may still deem index access less efficient than a full scan if:- The number of matching rows is high.
- The subquery’s result set is large.
- Data Type Mismatches: JSON extracts return TEXT values by default. If
categories.idis INTEGER, implicit type conversions may prevent index usage.
4. Query Structure and Planner Assumptions
The planner assumes that:
- Small Tables: Full scans are acceptable for small datasets (e.g.,
categorieswith few rows). - Index Selectivity: Indexes are only beneficial if they significantly reduce the search space. If the
todostable is small, a full scan might be faster than index lookups.
Resolving Index Utilization Issues in JSON-Based Queries
Step 1: Diagnose Query Plan Choices
Use EXPLAIN QUERY PLAN to analyze execution strategies:
EXPLAIN QUERY PLAN
SELECT * FROM todos
WHERE json_extract(todos.doc, '$.categoryId') IN (SELECT id FROM categories);
- Output Interpretation:
USING ROWID SEARCH ON TABLE categories: Indicates a correlated subquery usingcategories.idindex.SCAN todos: Full scan oftodos, implying the JSON index is unused.
Step 2: Disable Subquery Flattening
Force the subquery to materialize its results, enabling temporary index creation:
SELECT * FROM todos
WHERE json_extract(todos.doc, '$.categoryId') IN (SELECT +id FROM categories);
+idHack: The unary+operator prevents the subquery from being flattened into a correlated join. This forces SQLite to:- Materialize
SELECT id FROM categoriesinto a temporary table. - Build a temporary index on the materialized results.
- Use the
todos_doc_category_idindex for efficient lookups.
- Materialize
Step 3: Use Explicit Index Hints
Override the planner’s decision with INDEXED BY:
SELECT * FROM todos INDEXED BY todos_doc_category_id
WHERE json_extract(todos.doc, '$.categoryId') IN (SELECT id FROM categories);
- Risks:
- May degrade performance if the index is unsuitable for the data distribution.
- Requires manual validation of index effectiveness.
Step 4: Normalize JSON Data into Relational Columns
Store categoryId in a separate indexed column:
ALTER TABLE todos ADD COLUMN category_id INTEGER;
CREATE INDEX todos_category_id ON todos(category_id);
Update the column automatically using triggers:
CREATE TRIGGER update_todos_category_id
AFTER INSERT ON todos
BEGIN
UPDATE todos
SET category_id = json_extract(doc, '$.categoryId')
WHERE id = NEW.id;
END;
- Benefits:
- Eliminates JSON function overhead.
- Enables direct index usage without planner ambiguities.
Step 5: Use Materialized Views for JSON Data
Create a view that materializes JSON extracts:
CREATE VIEW todos_with_category AS
SELECT id, json_extract(doc, '$.categoryId') AS category_id, doc
FROM todos;
Query the materialized view:
SELECT * FROM todos_with_category
WHERE category_id IN (SELECT id FROM categories);
- Trade-offs:
- Increased storage for materialized data.
- Requires refresh mechanisms if
docchanges.
Step 6: Adjust Query Structure for Planner Friendliness
Rewrite the query to use a JOIN instead of IN:
SELECT todos.*
FROM todos
JOIN categories
ON categories.id = json_extract(todos.doc, '$.categoryId');
- Planner Behavior:
- Treats
categoriesas the outer loop. - Uses
categories.idindex for lookups intotodos.
- Treats
Step 7: Update Statistics for Cost-Based Optimizations
Run ANALYZE to refresh table statistics:
ANALYZE;
- Impact:
- Helps the planner make informed decisions about index vs. scan efficiency.
- Particularly useful after significant data changes.
Step 8: Consider JSON1 Extension Alternatives
Use json_tree() for structured JSON queries:
SELECT todos.*
FROM todos, json_tree(todos.doc)
WHERE json_tree.key = 'categoryId'
AND json_tree.value IN (SELECT id FROM categories);
- Advantages:
- Leverages JSON-specific optimizations.
- May bypass planner limitations with scalar extracts.
Step 9: Evaluate Schema Design Trade-Offs
- Hybrid JSON/Relational Models:
- Store frequently queried fields in relational columns.
- Reserve JSON for unstructured data.
- Full-Text Search Alternatives:
- Use
fts5virtual tables for JSON content requiring complex queries.
- Use
Step 10: Monitor Planner Decisions with Runtime Flags
Enable SQLite’s debugging features to log planner decisions:
PRAGMA compile_options;
-- Ensure 'ENABLE_EXPLAIN_COMMENTS' is enabled
EXPLAIN QUERY PLAN
SELECT ...;
- Output Analysis:
- Look for comments indicating why indexes were rejected.
- Example:
/* NOT INDEXED: json_extract(...) */.
By systematically addressing the interplay between JSON function usage, index eligibility, and query planner behavior, developers can enforce index utilization in subquery-based IN clauses while maintaining schema flexibility. The choice of solution depends on factors like data volatility, query frequency, and performance requirements.