ANALYZE Misreports Statistics for Indexes Using json_extract in SQLite

Issue Overview: Misleading Index Statistics from ANALYZE with JSON-Based Indexes

When working with SQLite’s ANALYZE command to gather statistics for query optimization, users may encounter unexpected results when indexes involve JSON extraction functions like json_extract(). A common symptom is observing disproportionately low values in the sqlite_stat1 table’s stat column for indexes built on JSON fields. This discrepancy can lead to suboptimal query plans, as the query planner relies on these statistics to estimate the cost of different access paths.

Consider a scenario where a table events contains a JSON column storing hierarchical relationships via a parentId field. An index is created on json_extract(json, '$.parentId'), and ANALYZE is executed to populate sqlite_stat1. The stat column for this index might report an average row count per distinct value (the second integer in the stat string) that seems inconsistent with the actual distribution observed in the data. For example, a value of 11 might be reported even though a direct query shows that the most frequent parentId (including NULL) appears 38,633 times. This mismatch stems from nuances in how ANALYZE calculates statistics for indexes involving JSON functions and how NULL values are handled in such indexes.

Possible Causes: JSON Extraction, NULL Handling, and Statistics Calculation

1. Misinterpretation of sqlite_stat1 Statistics

The stat column in sqlite_stat1 contains two critical integers for single-column indexes:

  • The first integer represents the total number of rows in the table.
  • The second integer represents the average number of rows per distinct value in the indexed column, calculated as total_rows / number_of_distinct_values.

A common misunderstanding is assuming the second integer reflects the maximum number of rows for a single value (e.g., the NULL group with 38,633 rows). Instead, it is the average across all distinct values, including NULL. If the indexed column has many distinct values, the average will be low, even if one value (like NULL) dominates.

2. Impact of NULL Values on Index Statistics

When an index includes NULL values (as it does by default in SQLite), ANALYZE treats NULL as a distinct value. For example, if 38,633 rows have NULL for json_extract(json, '$.parentId'), and the remaining 24,056 rows are spread across 2,186 distinct non-NULL values, the total distinct values are 2,186 + 1 = 2,187. The average rows per value becomes 62,689 / 2,187 ≈ 28.67, not 11. This suggests a deeper issue: the index might not be capturing the true distribution of values due to how JSON extraction interacts with statistics collection.

3. Data Type Mismatch Between Index and Queries

The json_extract() function returns JSON values as TEXT, even if the JSON field contains a number. For instance, {"parentId": 11384} is extracted as the string '11384', not the integer 11384. If queries implicitly or explicitly cast this value to an integer (e.g., via WHERE parentId = 11384), the index on the TEXT representation may not be used effectively, and statistics might not reflect the intended grouping.

Troubleshooting Steps, Solutions & Fixes: Accurate Statistics for JSON-Based Indexes

Step 1: Validate the Interpretation of sqlite_stat1 Data

Before making schema changes, confirm that the reported statistics align with the actual data distribution. Execute the following query to calculate the average rows per distinct value manually:

SELECT 
  (SELECT COUNT(*) FROM events) AS total_rows,
  COUNT(DISTINCT json_extract(json, '$.parentId')) AS distinct_values,
  (total_rows * 1.0 / distinct_values) AS calculated_average
FROM events;

Compare the calculated_average with the second integer in sqlite_stat1.stat. If they match, the issue is not with ANALYZE but with how the statistics are being interpreted. If they do not match, proceed to the next steps.

Step 2: Address NULL Values with Partial Indexes

To exclude NULL values from the index (and its statistics), create a partial index that filters out rows where json_extract(json, '$.parentId') IS NULL:

CREATE INDEX events_parentId_not_null ON events (
  json_extract(json, '$.parentId')
) WHERE json_extract(json, '$.parentId') IS NOT NULL;

Re-run ANALYZE and check sqlite_stat1 for the new index. The average rows per value should now reflect only non-NULL entries. For example, if there are 24,056 non-NULL rows and 2,186 distinct values, the average becomes 24,056 / 2,186 ≈ 11, matching the originally reported statistic. This makes the index more selective for non-NULL queries.

Step 3: Use Generated Columns for Type Consistency

To resolve data type mismatches between JSON-extracted TEXT values and integer-based queries, define a stored generated column that explicitly casts the JSON value to an integer:

ALTER TABLE events ADD COLUMN parentId INTEGER 
  GENERATED ALWAYS AS (json_extract(json, '$.parentId')) STORED;
CREATE INDEX events_parentId_int ON events(parentId);

This column stores parentId as an INTEGER, ensuring that the index and queries use the same data type. ANALYZE will now collect statistics based on integer values, which are more likely to align with query predicates. Note that stored columns increase storage usage but improve query performance.

Step 4: Verify JSON Structure and Extraction Logic

Ensure that all JSON entries for parentId use a consistent data type (e.g., always a number, never a string). Inconsistent typing (e.g., "parentId": "11384" vs. "parentId": 11384) leads to multiple distinct values in the index ('11384' vs. 11384), inflating the number of distinct values and lowering the average rows per value. Use a JSON schema validator or trigger to enforce consistency.

Step 5: Rebuild Statistics with Full Scan

By default, ANALYZE uses a sample of the index to estimate statistics. Force a full scan for accurate results:

ANALYZE sqlite_stat1;  -- Delete existing statistics
ANALYZE;               -- Regenerate with full scan

Check if the updated statistics match the manual calculation from Step 1. If discrepancies persist, consider reporting the issue to the SQLite team, as it may indicate a bug in how JSON functions interact with statistics collection.

Step 6: Use Covering Indexes for Critical Queries

If the query planner still chooses suboptimal plans, create a covering index that includes both the extracted parentId and frequently accessed columns:

CREATE INDEX events_parentId_covering ON events (
  json_extract(json, '$.parentId')
) INCLUDE (id, json);

This reduces the need for the planner to access the main table, mitigating the impact of inaccurate statistics.

Final Solution: Combining Partial Indexes and Generated Columns

For optimal results, combine the above strategies. Use a generated column to enforce type consistency and a partial index to exclude NULLs:

ALTER TABLE events ADD COLUMN parentId INTEGER 
  GENERATED ALWAYS AS (json_extract(json, '$.parentId')) STORED;
CREATE INDEX events_parentId_optimized ON events(parentId) 
  WHERE parentId IS NOT NULL;
ANALYZE;

This setup ensures accurate statistics, efficient index usage, and type safety for queries involving parentId.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *