Index Not Used for JSON_EXTRACT with Subquery in SQLite
Understanding Index Usage Discrepancies in JSON_EXTRACT Queries
The core issue revolves around inconsistent index utilization in SQLite when querying JSON data stored in a text column. Specifically, an index created on the result of JSON_EXTRACT(doc, '$.categoryId')
is used when filtering with an explicit list of values (e.g., IN (1, 2)
), but not when filtering with a subquery that retrieves values from another table (e.g., IN (SELECT id FROM categories)
). This behavior contradicts expectations, as developers often assume equivalent optimization for both forms of IN
clauses.
Key Components of the Schema
categories
Table: Stores category IDs with an integer primary key.todos
Table: Contains adoc
column storing JSON data, including acategoryId
field.- Index Definition:
CREATE INDEX todos_doc_category_id ON todos (JSON_EXTRACT(doc, '$.categoryId'));
This index is designed to accelerate queries filtering on
doc->>'$.categoryId'
.
Observed Behavior
Literal List Query:
SELECT * FROM todos WHERE json_extract(todos.doc, '$.categoryId') IN (1, 2);
- Execution Plan: Uses index
todos_doc_category_id
viaSEARCH
operation. - Efficiency: Direct lookup using indexed values.
- Execution Plan: Uses index
Subquery-Based Filter:
SELECT * FROM todos WHERE json_extract(todos.doc, '$.categoryId') IN (SELECT id FROM categories);
- Execution Plan: Full table scan (
SCAN todos
) with rowid-based search oncategories
. - Inefficiency: Index is ignored despite identical logical intent.
- Execution Plan: Full table scan (
Underlying Mechanics of SQLite’s Query Planner
SQLite’s query planner evaluates multiple strategies for executing a query, including:
- Index Scans: Leveraging B-tree structures for rapid value retrieval.
- Full Scans: Iterating through all rows when index use is deemed inefficient.
- Correlated vs. Uncorrelated Subqueries: Deciding whether to treat subqueries as static lists or dynamic filters.
The discrepancy arises from how SQLite optimizes IN
clauses:
- Literal Lists: Treated as fixed datasets. The planner generates a temporary in-memory index, enabling efficient index-driven lookups.
- Subqueries: May trigger correlation optimizations, transforming
IN
into anEXISTS
-style predicate. This alters execution order and invalidates assumptions about index applicability.
Factors Influencing Index Selection in Subquery vs. Literal List Contexts
1. Optimization of IN
Clauses to EXISTS
Correlated Subqueries
When a subquery is used in an IN
clause, SQLite may rewrite it as:
SELECT * FROM todos
WHERE EXISTS (
SELECT 1 FROM categories
WHERE categories.id = json_extract(todos.doc, '$.categoryId')
);
This transformation changes the execution dynamics:
- Outer Loop: Iterates over
todos
rows. - Inner Loop: Performs a rowid lookup on
categories
for eachtodos
row.
Why the Index Is Ignored:
- The index
todos_doc_category_id
stores precomputedjson_extract(doc, '$.categoryId')
values. - However, the rewritten query requires comparing each
todos
row’s JSON-extracted value against thecategories
table. Sincecategories.id
is indexed (as a primary key), SQLite prioritizes scanningtodos
and usingcategories
’s index for the inner loop.
2. Cost Estimation and Temporary Index Creation
SQLite’s planner uses heuristic cost models to choose execution plans:
Literal List (
IN (1, 2)
):- The list
(1, 2)
is small and known at compile-time. - A temporary index is created for these values, allowing the planner to use the
todos_doc_category_id
index for efficient lookups.
- The list
Subquery (
IN (SELECT ...)
):- The subquery’s result set size is unknown until runtime.
- Creating a temporary index for a potentially large result set is cost-prohibitive.
- The planner defaults to a nested loop join, scanning
todos
and usingcategories
’s primary key for quick existence checks.
3. Impact of JSON Function Usage on Index Eligibility
Indexes on JSON-extracted values have inherent limitations:
- Runtime Evaluation:
JSON_EXTRACT()
is evaluated during query execution, not at index creation. While the index stores precomputed results, the planner may still deem index access less efficient than a full scan if:- The number of matching rows is high.
- The subquery’s result set is large.
- Data Type Mismatches: JSON extracts return TEXT values by default. If
categories.id
is INTEGER, implicit type conversions may prevent index usage.
4. Query Structure and Planner Assumptions
The planner assumes that:
- Small Tables: Full scans are acceptable for small datasets (e.g.,
categories
with few rows). - Index Selectivity: Indexes are only beneficial if they significantly reduce the search space. If the
todos
table is small, a full scan might be faster than index lookups.
Resolving Index Utilization Issues in JSON-Based Queries
Step 1: Diagnose Query Plan Choices
Use EXPLAIN QUERY PLAN
to analyze execution strategies:
EXPLAIN QUERY PLAN
SELECT * FROM todos
WHERE json_extract(todos.doc, '$.categoryId') IN (SELECT id FROM categories);
- Output Interpretation:
USING ROWID SEARCH ON TABLE categories
: Indicates a correlated subquery usingcategories.id
index.SCAN todos
: Full scan oftodos
, implying the JSON index is unused.
Step 2: Disable Subquery Flattening
Force the subquery to materialize its results, enabling temporary index creation:
SELECT * FROM todos
WHERE json_extract(todos.doc, '$.categoryId') IN (SELECT +id FROM categories);
+id
Hack: The unary+
operator prevents the subquery from being flattened into a correlated join. This forces SQLite to:- Materialize
SELECT id FROM categories
into a temporary table. - Build a temporary index on the materialized results.
- Use the
todos_doc_category_id
index for efficient lookups.
- Materialize
Step 3: Use Explicit Index Hints
Override the planner’s decision with INDEXED BY
:
SELECT * FROM todos INDEXED BY todos_doc_category_id
WHERE json_extract(todos.doc, '$.categoryId') IN (SELECT id FROM categories);
- Risks:
- May degrade performance if the index is unsuitable for the data distribution.
- Requires manual validation of index effectiveness.
Step 4: Normalize JSON Data into Relational Columns
Store categoryId
in a separate indexed column:
ALTER TABLE todos ADD COLUMN category_id INTEGER;
CREATE INDEX todos_category_id ON todos(category_id);
Update the column automatically using triggers:
CREATE TRIGGER update_todos_category_id
AFTER INSERT ON todos
BEGIN
UPDATE todos
SET category_id = json_extract(doc, '$.categoryId')
WHERE id = NEW.id;
END;
- Benefits:
- Eliminates JSON function overhead.
- Enables direct index usage without planner ambiguities.
Step 5: Use Materialized Views for JSON Data
Create a view that materializes JSON extracts:
CREATE VIEW todos_with_category AS
SELECT id, json_extract(doc, '$.categoryId') AS category_id, doc
FROM todos;
Query the materialized view:
SELECT * FROM todos_with_category
WHERE category_id IN (SELECT id FROM categories);
- Trade-offs:
- Increased storage for materialized data.
- Requires refresh mechanisms if
doc
changes.
Step 6: Adjust Query Structure for Planner Friendliness
Rewrite the query to use a JOIN
instead of IN
:
SELECT todos.*
FROM todos
JOIN categories
ON categories.id = json_extract(todos.doc, '$.categoryId');
- Planner Behavior:
- Treats
categories
as the outer loop. - Uses
categories.id
index for lookups intotodos
.
- Treats
Step 7: Update Statistics for Cost-Based Optimizations
Run ANALYZE
to refresh table statistics:
ANALYZE;
- Impact:
- Helps the planner make informed decisions about index vs. scan efficiency.
- Particularly useful after significant data changes.
Step 8: Consider JSON1 Extension Alternatives
Use json_tree()
for structured JSON queries:
SELECT todos.*
FROM todos, json_tree(todos.doc)
WHERE json_tree.key = 'categoryId'
AND json_tree.value IN (SELECT id FROM categories);
- Advantages:
- Leverages JSON-specific optimizations.
- May bypass planner limitations with scalar extracts.
Step 9: Evaluate Schema Design Trade-Offs
- Hybrid JSON/Relational Models:
- Store frequently queried fields in relational columns.
- Reserve JSON for unstructured data.
- Full-Text Search Alternatives:
- Use
fts5
virtual tables for JSON content requiring complex queries.
- Use
Step 10: Monitor Planner Decisions with Runtime Flags
Enable SQLite’s debugging features to log planner decisions:
PRAGMA compile_options;
-- Ensure 'ENABLE_EXPLAIN_COMMENTS' is enabled
EXPLAIN QUERY PLAN
SELECT ...;
- Output Analysis:
- Look for comments indicating why indexes were rejected.
- Example:
/* NOT INDEXED: json_extract(...) */
.
By systematically addressing the interplay between JSON function usage, index eligibility, and query planner behavior, developers can enforce index utilization in subquery-based IN
clauses while maintaining schema flexibility. The choice of solution depends on factors like data volatility, query frequency, and performance requirements.