Query Planner Selects Suboptimal Index Despite Analyze in Spatial Filter Scenario

Spatial Index vs. Column Filter Selectivity Mismatch in Query Execution Plans

Understanding the Core Conflict Between Spatial Indexes and Column Filters

The issue arises when SQLite’s query planner selects a spatial index (R-tree) over a highly selective column filter (lod = 0) despite clear statistical evidence that the latter would drastically reduce the result set. The database contains a table ascat with a lod column (values 0–8 or 255) and a spatial column the_geom indexed via an R-tree virtual table (rtree_ascat_the_geom). A query combines both a bounding box filter (using the R-tree) and a lod equality condition. While the lod filter reduces the candidate rows to 267 records, the spatial filter selects 126,347 rows. Despite this disparity, the query planner prioritizes the spatial index, leading to inefficient execution.

The root cause lies in SQLite’s cost-based optimizer (CBO) misestimating the selectivity of the spatial filter relative to the lod column filter. The R-tree’s virtual table implementation and the use of a subquery (fid IN (SELECT id FROM rtree_...) complicate selectivity estimation. SQLite’s statistics collection via ANALYZE does not fully account for spatial index characteristics, and the CBO defaults to heuristics that favor the spatial index path even when column filters are demonstrably more selective.

Factors Leading to Suboptimal Index Selection in Spatial Queries

  1. Spatial Index Statistics Are Not Integrated with ANALYZE:
    SQLite’s ANALYZE command populates the sqlite_stat1 table with histogram data for standard B-tree indexes but does not gather statistics for R-tree virtual tables. The R-tree’s internal structure (hierarchical bounding boxes) is opaque to the query planner, which assumes uniform data distribution. This leads to underestimation of spatial filter selectivity when the bounding box covers most of the coordinate space (e.g., global extents).

  2. Subquery Execution Order and Predicate Pushdown Limitations:
    The fid IN (SELECT id FROM rtree_...) subquery forces SQLite to process the spatial filter first. The CBO lacks visibility into the correlation between fid values and the R-tree’s bounding boxes. Without accurate row count estimates for the subquery, the planner defaults to scanning the R-tree index, unaware that the lod filter would eliminate 99.8% of rows.

  3. Integer Primary Key (RowID) Assumptions:
    The fid column is an INTEGER PRIMARY KEY, which aliases the implicit rowid. SQLite assumes direct rowid access is always efficient, but in this case, the IN clause converts the spatial subquery into a list of rowids, bypassing potential optimizations like predicate reordering. The planner prioritizes the spatial index scan due to the perceived efficiency of rowid lookups, even when the lod index would yield fewer rows.

  4. Query Structure and Auto-Generated SQL Constraints:
    Auto-generated queries using rigid patterns (e.g., WHERE lod = X AND fid IN (spatial_subquery)) limit the planner’s ability to reorder predicates. The AND condition forces both filters to be applied, but the lack of a JOIN syntax prevents SQLite from considering alternative join orders or index merges.

Optimizing Query Plans for Mixed Spatial and Columnar Filters

Step 1: Validate Statistics Collection and Index Coverage
Ensure ANALYZE has been executed correctly and that statistics exist for the lod_index. Query sqlite_stat1 to verify:

SELECT tbl, idx, stat FROM sqlite_stat1 WHERE tbl = 'ascat';

If lod_index is missing, re-run ANALYZE. For spatial indexes, manually estimate selectivity by querying the R-tree’s node structure (if accessible) or use empirical testing to derive correction factors.

Step 2: Rewrite the Query to Use Explicit JOIN Syntax
Replace the IN subquery with a JOIN to give the planner flexibility in reordering operations:

SELECT a.fid, a.the_geom, a.wind_speed, a.wind_direc, a.backscatte, a.lod
FROM ascat a
JOIN rtree_ascat_the_geom r ON a.fid = r.id
WHERE a.lod = 0
  AND r.maxx >= -182.5714285714386 AND r.minx <= 182.5714285714386
  AND r.maxy >= -92.57142857143857 AND r.miny <= 92.57142857143857;

This formulation allows SQLite to consider both lod_index and the spatial index as equal candidates, potentially using the lod filter to reduce the dataset before applying the spatial constraints.

Step 3: Force Index Selection with Query Hints
If rewriting queries is impractical, use INDEXED BY to mandate the lod_index:

SELECT ... FROM ascat INDEXED BY lod_index WHERE lod = 0 AND fid IN (...);

Caution: This approach risks plan regression if selectivity ratios change. Reserve it for cases where lod is consistently the most selective filter.

Step 4: Leverage Composite Indexes or Covering Indexes
Create a composite index on (lod, fid) to allow efficient filtering on lod while retrieving fid values for spatial validation:

CREATE INDEX lod_fid_index ON ascat(lod, fid);

The covering index eliminates the need for table lookups after index scans, reducing I/O overhead. However, this increases index storage size and write latency.

Step 5: Use SQLITE_STATS to Manually Adjust Selectivity Estimates
For advanced scenarios, insert custom statistics into sqlite_stat1 to override the CBO’s assumptions. Example:

INSERT INTO sqlite_stat1(tbl, idx, stat)
VALUES ('ascat', 'lod_index', '267 1');

This tells SQLite that lod_index on ascat returns 267 rows on average. Adjust values based on empirical observations of lod selectivity.

Step 6: Partition Data by LOD for Coarse-Grained Filtering
If lod values are static, partition the table into smaller subtables (e.g., ascat_lod0, ascat_lod1) or use CHECK constraints to enable partition pruning. This physically segregates data by lod, making the filter inherently efficient.

Step 7: Implement a Spatial Bounding Box Cache
Precompute the spatial extent of each lod group and store it in a helper table:

CREATE TABLE lod_spatial_extents (
  lod INTEGER PRIMARY KEY,
  minx REAL, maxx REAL,
  miny REAL, maxy REAL
);

Populate it via:

INSERT INTO lod_spatial_extents
SELECT lod, MIN(r.minx), MAX(r.maxx), MIN(r.miny), MAX(r.maxy)
FROM ascat a JOIN rtree_ascat_the_geom r ON a.fid = r.id
GROUP BY lod;

Modify queries to first check the spatial extents for the target lod:

SELECT a.fid, ...
FROM ascat a
JOIN rtree_ascat_the_geom r ON a.fid = r.id
JOIN lod_spatial_extents e ON a.lod = e.lod
WHERE a.lod = 0
  AND r.maxx >= -182.57 AND r.minx <= 182.57
  AND r.maxy >= -92.57 AND r.miny <= 92.57
  AND -182.57 <= e.maxx AND 182.57 >= e.minx  -- Early exit if lod's extent doesn't overlap
  AND -92.57 <= e.maxy AND 92.57 >= e.miny;

This hybrid approach uses coarse spatial metadata to short-circuit unnecessary spatial index scans.

Step 8: Utilize Subquery Materialization
Force materialization of the spatial subquery to isolate its cost:

SELECT ... FROM ascat
WHERE lod = 0
  AND fid IN (
    SELECT id FROM rtree_ascat_the_geom
    WHERE maxx >= -182.57 AND minx <= 182.57
      AND maxy >= -92.57 AND miny <= 92.57
  );

Adding MATERIALIZED (if supported by the SQLite version) can influence the planner to evaluate the subquery’s result size before proceeding.

Step 9: Profile and Compare Execution Plans
Use EXPLAIN QUERY PLAN and SQLite’s bytecode engine (EXPLAIN) to compare the CPU and I/O costs of alternative query forms. Look for discrepancies in estimated row counts at each pipeline stage.

Step 10: Consider Spatial Index Tuning
Adjust the R-tree’s node size (via rtree_node_size) to balance read/write performance. Smaller nodes improve spatial selectivity but increase tree depth. Experiment with values between 32 and 512 bytes.

Final Note on Adaptive Optimization
Implement a feedback loop where query performance metrics dynamically adjust future query formulations. For auto-generated SQL, integrate a middleware layer that rewrites queries based on real-time selectivity estimates derived from sampled data.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *