Query Planner Selects Suboptimal Index Despite Analyze in Spatial Filter Scenario
Spatial Index vs. Column Filter Selectivity Mismatch in Query Execution Plans
Understanding the Core Conflict Between Spatial Indexes and Column Filters
The issue arises when SQLite’s query planner selects a spatial index (R-tree) over a highly selective column filter (lod = 0
) despite clear statistical evidence that the latter would drastically reduce the result set. The database contains a table ascat
with a lod
column (values 0–8 or 255) and a spatial column the_geom
indexed via an R-tree virtual table (rtree_ascat_the_geom
). A query combines both a bounding box filter (using the R-tree) and a lod
equality condition. While the lod
filter reduces the candidate rows to 267 records, the spatial filter selects 126,347 rows. Despite this disparity, the query planner prioritizes the spatial index, leading to inefficient execution.
The root cause lies in SQLite’s cost-based optimizer (CBO) misestimating the selectivity of the spatial filter relative to the lod
column filter. The R-tree’s virtual table implementation and the use of a subquery (fid IN (SELECT id FROM rtree_...
) complicate selectivity estimation. SQLite’s statistics collection via ANALYZE
does not fully account for spatial index characteristics, and the CBO defaults to heuristics that favor the spatial index path even when column filters are demonstrably more selective.
Factors Leading to Suboptimal Index Selection in Spatial Queries
Spatial Index Statistics Are Not Integrated with ANALYZE:
SQLite’sANALYZE
command populates thesqlite_stat1
table with histogram data for standard B-tree indexes but does not gather statistics for R-tree virtual tables. The R-tree’s internal structure (hierarchical bounding boxes) is opaque to the query planner, which assumes uniform data distribution. This leads to underestimation of spatial filter selectivity when the bounding box covers most of the coordinate space (e.g., global extents).Subquery Execution Order and Predicate Pushdown Limitations:
Thefid IN (SELECT id FROM rtree_...)
subquery forces SQLite to process the spatial filter first. The CBO lacks visibility into the correlation betweenfid
values and the R-tree’s bounding boxes. Without accurate row count estimates for the subquery, the planner defaults to scanning the R-tree index, unaware that thelod
filter would eliminate 99.8% of rows.Integer Primary Key (RowID) Assumptions:
Thefid
column is anINTEGER PRIMARY KEY
, which aliases the implicitrowid
. SQLite assumes direct rowid access is always efficient, but in this case, theIN
clause converts the spatial subquery into a list of rowids, bypassing potential optimizations like predicate reordering. The planner prioritizes the spatial index scan due to the perceived efficiency of rowid lookups, even when thelod
index would yield fewer rows.Query Structure and Auto-Generated SQL Constraints:
Auto-generated queries using rigid patterns (e.g.,WHERE lod = X AND fid IN (spatial_subquery)
) limit the planner’s ability to reorder predicates. TheAND
condition forces both filters to be applied, but the lack of aJOIN
syntax prevents SQLite from considering alternative join orders or index merges.
Optimizing Query Plans for Mixed Spatial and Columnar Filters
Step 1: Validate Statistics Collection and Index Coverage
Ensure ANALYZE
has been executed correctly and that statistics exist for the lod_index
. Query sqlite_stat1
to verify:
SELECT tbl, idx, stat FROM sqlite_stat1 WHERE tbl = 'ascat';
If lod_index
is missing, re-run ANALYZE
. For spatial indexes, manually estimate selectivity by querying the R-tree’s node structure (if accessible) or use empirical testing to derive correction factors.
Step 2: Rewrite the Query to Use Explicit JOIN Syntax
Replace the IN
subquery with a JOIN
to give the planner flexibility in reordering operations:
SELECT a.fid, a.the_geom, a.wind_speed, a.wind_direc, a.backscatte, a.lod
FROM ascat a
JOIN rtree_ascat_the_geom r ON a.fid = r.id
WHERE a.lod = 0
AND r.maxx >= -182.5714285714386 AND r.minx <= 182.5714285714386
AND r.maxy >= -92.57142857143857 AND r.miny <= 92.57142857143857;
This formulation allows SQLite to consider both lod_index
and the spatial index as equal candidates, potentially using the lod
filter to reduce the dataset before applying the spatial constraints.
Step 3: Force Index Selection with Query Hints
If rewriting queries is impractical, use INDEXED BY
to mandate the lod_index
:
SELECT ... FROM ascat INDEXED BY lod_index WHERE lod = 0 AND fid IN (...);
Caution: This approach risks plan regression if selectivity ratios change. Reserve it for cases where lod
is consistently the most selective filter.
Step 4: Leverage Composite Indexes or Covering Indexes
Create a composite index on (lod, fid)
to allow efficient filtering on lod
while retrieving fid
values for spatial validation:
CREATE INDEX lod_fid_index ON ascat(lod, fid);
The covering index eliminates the need for table lookups after index scans, reducing I/O overhead. However, this increases index storage size and write latency.
Step 5: Use SQLITE_STATS to Manually Adjust Selectivity Estimates
For advanced scenarios, insert custom statistics into sqlite_stat1
to override the CBO’s assumptions. Example:
INSERT INTO sqlite_stat1(tbl, idx, stat)
VALUES ('ascat', 'lod_index', '267 1');
This tells SQLite that lod_index
on ascat
returns 267 rows on average. Adjust values based on empirical observations of lod
selectivity.
Step 6: Partition Data by LOD for Coarse-Grained Filtering
If lod
values are static, partition the table into smaller subtables (e.g., ascat_lod0
, ascat_lod1
) or use CHECK
constraints to enable partition pruning. This physically segregates data by lod
, making the filter inherently efficient.
Step 7: Implement a Spatial Bounding Box Cache
Precompute the spatial extent of each lod
group and store it in a helper table:
CREATE TABLE lod_spatial_extents (
lod INTEGER PRIMARY KEY,
minx REAL, maxx REAL,
miny REAL, maxy REAL
);
Populate it via:
INSERT INTO lod_spatial_extents
SELECT lod, MIN(r.minx), MAX(r.maxx), MIN(r.miny), MAX(r.maxy)
FROM ascat a JOIN rtree_ascat_the_geom r ON a.fid = r.id
GROUP BY lod;
Modify queries to first check the spatial extents for the target lod
:
SELECT a.fid, ...
FROM ascat a
JOIN rtree_ascat_the_geom r ON a.fid = r.id
JOIN lod_spatial_extents e ON a.lod = e.lod
WHERE a.lod = 0
AND r.maxx >= -182.57 AND r.minx <= 182.57
AND r.maxy >= -92.57 AND r.miny <= 92.57
AND -182.57 <= e.maxx AND 182.57 >= e.minx -- Early exit if lod's extent doesn't overlap
AND -92.57 <= e.maxy AND 92.57 >= e.miny;
This hybrid approach uses coarse spatial metadata to short-circuit unnecessary spatial index scans.
Step 8: Utilize Subquery Materialization
Force materialization of the spatial subquery to isolate its cost:
SELECT ... FROM ascat
WHERE lod = 0
AND fid IN (
SELECT id FROM rtree_ascat_the_geom
WHERE maxx >= -182.57 AND minx <= 182.57
AND maxy >= -92.57 AND miny <= 92.57
);
Adding MATERIALIZED
(if supported by the SQLite version) can influence the planner to evaluate the subquery’s result size before proceeding.
Step 9: Profile and Compare Execution Plans
Use EXPLAIN QUERY PLAN
and SQLite’s bytecode engine (EXPLAIN
) to compare the CPU and I/O costs of alternative query forms. Look for discrepancies in estimated row counts at each pipeline stage.
Step 10: Consider Spatial Index Tuning
Adjust the R-tree’s node size (via rtree_node_size
) to balance read/write performance. Smaller nodes improve spatial selectivity but increase tree depth. Experiment with values between 32 and 512 bytes.
Final Note on Adaptive Optimization
Implement a feedback loop where query performance metrics dynamically adjust future query formulations. For auto-generated SQL, integrate a middleware layer that rewrites queries based on real-time selectivity estimates derived from sampled data.