ANALYZE Misreports Statistics for Indexes Using json_extract in SQLite
Issue Overview: Misleading Index Statistics from ANALYZE with JSON-Based Indexes
When working with SQLite’s ANALYZE
command to gather statistics for query optimization, users may encounter unexpected results when indexes involve JSON extraction functions like json_extract()
. A common symptom is observing disproportionately low values in the sqlite_stat1
table’s stat
column for indexes built on JSON fields. This discrepancy can lead to suboptimal query plans, as the query planner relies on these statistics to estimate the cost of different access paths.
Consider a scenario where a table events
contains a JSON column storing hierarchical relationships via a parentId
field. An index is created on json_extract(json, '$.parentId')
, and ANALYZE
is executed to populate sqlite_stat1
. The stat
column for this index might report an average row count per distinct value (the second integer in the stat
string) that seems inconsistent with the actual distribution observed in the data. For example, a value of 11
might be reported even though a direct query shows that the most frequent parentId
(including NULL
) appears 38,633 times. This mismatch stems from nuances in how ANALYZE
calculates statistics for indexes involving JSON functions and how NULL
values are handled in such indexes.
Possible Causes: JSON Extraction, NULL Handling, and Statistics Calculation
1. Misinterpretation of sqlite_stat1
Statistics
The stat
column in sqlite_stat1
contains two critical integers for single-column indexes:
- The first integer represents the total number of rows in the table.
- The second integer represents the average number of rows per distinct value in the indexed column, calculated as
total_rows / number_of_distinct_values
.
A common misunderstanding is assuming the second integer reflects the maximum number of rows for a single value (e.g., the NULL
group with 38,633 rows). Instead, it is the average across all distinct values, including NULL
. If the indexed column has many distinct values, the average will be low, even if one value (like NULL
) dominates.
2. Impact of NULL
Values on Index Statistics
When an index includes NULL
values (as it does by default in SQLite), ANALYZE
treats NULL
as a distinct value. For example, if 38,633 rows have NULL
for json_extract(json, '$.parentId')
, and the remaining 24,056 rows are spread across 2,186 distinct non-NULL
values, the total distinct values are 2,186 + 1 = 2,187
. The average rows per value becomes 62,689 / 2,187 ≈ 28.67
, not 11
. This suggests a deeper issue: the index might not be capturing the true distribution of values due to how JSON extraction interacts with statistics collection.
3. Data Type Mismatch Between Index and Queries
The json_extract()
function returns JSON values as TEXT, even if the JSON field contains a number. For instance, {"parentId": 11384}
is extracted as the string '11384'
, not the integer 11384
. If queries implicitly or explicitly cast this value to an integer (e.g., via WHERE parentId = 11384
), the index on the TEXT representation may not be used effectively, and statistics might not reflect the intended grouping.
Troubleshooting Steps, Solutions & Fixes: Accurate Statistics for JSON-Based Indexes
Step 1: Validate the Interpretation of sqlite_stat1
Data
Before making schema changes, confirm that the reported statistics align with the actual data distribution. Execute the following query to calculate the average rows per distinct value manually:
SELECT
(SELECT COUNT(*) FROM events) AS total_rows,
COUNT(DISTINCT json_extract(json, '$.parentId')) AS distinct_values,
(total_rows * 1.0 / distinct_values) AS calculated_average
FROM events;
Compare the calculated_average
with the second integer in sqlite_stat1.stat
. If they match, the issue is not with ANALYZE
but with how the statistics are being interpreted. If they do not match, proceed to the next steps.
Step 2: Address NULL
Values with Partial Indexes
To exclude NULL
values from the index (and its statistics), create a partial index that filters out rows where json_extract(json, '$.parentId') IS NULL
:
CREATE INDEX events_parentId_not_null ON events (
json_extract(json, '$.parentId')
) WHERE json_extract(json, '$.parentId') IS NOT NULL;
Re-run ANALYZE
and check sqlite_stat1
for the new index. The average rows per value should now reflect only non-NULL
entries. For example, if there are 24,056 non-NULL
rows and 2,186 distinct values, the average becomes 24,056 / 2,186 ≈ 11
, matching the originally reported statistic. This makes the index more selective for non-NULL
queries.
Step 3: Use Generated Columns for Type Consistency
To resolve data type mismatches between JSON-extracted TEXT values and integer-based queries, define a stored generated column that explicitly casts the JSON value to an integer:
ALTER TABLE events ADD COLUMN parentId INTEGER
GENERATED ALWAYS AS (json_extract(json, '$.parentId')) STORED;
CREATE INDEX events_parentId_int ON events(parentId);
This column stores parentId
as an INTEGER, ensuring that the index and queries use the same data type. ANALYZE
will now collect statistics based on integer values, which are more likely to align with query predicates. Note that stored columns increase storage usage but improve query performance.
Step 4: Verify JSON Structure and Extraction Logic
Ensure that all JSON entries for parentId
use a consistent data type (e.g., always a number, never a string). Inconsistent typing (e.g., "parentId": "11384"
vs. "parentId": 11384
) leads to multiple distinct values in the index ('11384'
vs. 11384
), inflating the number of distinct values and lowering the average rows per value. Use a JSON schema validator or trigger to enforce consistency.
Step 5: Rebuild Statistics with Full Scan
By default, ANALYZE
uses a sample of the index to estimate statistics. Force a full scan for accurate results:
ANALYZE sqlite_stat1; -- Delete existing statistics
ANALYZE; -- Regenerate with full scan
Check if the updated statistics match the manual calculation from Step 1. If discrepancies persist, consider reporting the issue to the SQLite team, as it may indicate a bug in how JSON functions interact with statistics collection.
Step 6: Use Covering Indexes for Critical Queries
If the query planner still chooses suboptimal plans, create a covering index that includes both the extracted parentId
and frequently accessed columns:
CREATE INDEX events_parentId_covering ON events (
json_extract(json, '$.parentId')
) INCLUDE (id, json);
This reduces the need for the planner to access the main table, mitigating the impact of inaccurate statistics.
Final Solution: Combining Partial Indexes and Generated Columns
For optimal results, combine the above strategies. Use a generated column to enforce type consistency and a partial index to exclude NULL
s:
ALTER TABLE events ADD COLUMN parentId INTEGER
GENERATED ALWAYS AS (json_extract(json, '$.parentId')) STORED;
CREATE INDEX events_parentId_optimized ON events(parentId)
WHERE parentId IS NOT NULL;
ANALYZE;
This setup ensures accurate statistics, efficient index usage, and type safety for queries involving parentId
.