Accurately Estimating Virtual Table Costs and Rows for SQLite Query Optimization


Understanding SQLite’s Cost and Row Estimation for Virtual Table Integration

When developing virtual tables in SQLite that proxy real tables, achieving parity in query plan efficiency requires precise emulation of how SQLite estimates rows and costs for real tables. Misalignment in these estimates can lead to suboptimal join orders, index selections, and execution strategies. This guide dissects the mechanisms behind SQLite’s internal estimation logic, identifies common pitfalls in virtual table implementations, and provides actionable solutions to align virtual table behavior with real table performance characteristics.


Core Components of SQLite’s Estimation Logic

SQLite’s query planner relies on two critical metrics provided by tables (real or virtual) during optimization:

  1. Estimated Rows: The approximate number of rows that will be returned by a query operation.
  2. Estimated Cost: A measure of computational effort (e.g., disk I/O operations) required to execute the operation.

For real tables, these values are derived from a combination of schema metadata, index statistics (sqlite_stat1), and B-tree structure analysis (dbstat virtual table). Virtual tables must replicate this logic programmatically in their xBestIndex method to ensure parity. Below, we explore how SQLite computes these values and how virtual tables can mirror them.


Discrepancies Between Virtual Table Estimates and Real Table Behavior

1. Misinterpretation of LogEst Values

SQLite internally represents row counts and costs using LogEst, a compressed logarithmic scale. Directly using raw values from sqlite_stat1 or dbstat without conversion leads to mismatches. For example:

  • A sqlite_stat1 entry with nRow=100 might be stored as LogEst=46 (see LogEst conversion).
  • A virtual table returning estimatedRows=100 instead of LogEst=46 will confuse the planner, as .wheretrace outputs reflect LogEst values.

2. Inaccurate Disk Access Modeling

The estimatedCost field in xBestIndex must approximate the number of disk accesses required for the operation. Real tables compute this by analyzing B-tree depth (for rowid lookups) or leaf pages (for scans), but virtual tables often overlook:

  • B-tree traversal costs: Accessing a rowid in a B-tree of depth 3 requires 3 disk accesses (root + internal + leaf).
  • Index coverage: A covering index avoids heap accesses, reducing cost.

3. Failure to Align with Query Planner Heuristics

SQLite applies heuristics to avoid pathological plans. For instance:

  • "WHERE clause optimism": The planner assumes equality constraints (col=42) reduce row counts more aggressively than range constraints (col>42).
  • Index usability: Unique indexes are preferred over non-unique ones, even if their estimated costs are similar.

Virtual tables that ignore these nuances will produce incompatible cost/row estimates, leading to suboptimal plans.


Aligning Virtual Table Estimates with SQLite’s Internal Logic

Step 1: Convert sqlite_stat1 Values to LogEst

The sqlite_stat1 table stores statistics in LogEst format. Use the following conversion tools:

  • LogEst to Integer:
    int logEstToInt(int logEst) {
      return (int)(pow(2, logEst/10.0) + 0.5);
    }
    
  • Integer to LogEst:
    int intToLogEst(int n) {
      return (n == 0) ? 0 : (int)(10.0 * log2(n));
    }
    

Example:
If sqlite_stat1 reports nRow=46 (LogEst for 100 rows), a virtual table should return estimatedRows=46 instead of 100.

Step 2: Compute Disk Accesses Using dbstat

The dbstat virtual table exposes B-tree geometry. For a real table tbl:

-- Get B-tree depth and leaf pages:
SELECT name, path, pageno, pagetype, ncell 
FROM dbstat WHERE name='tbl';
  • Rowid Lookup Cost:
    estimatedCost = depth_of_btree; // e.g., 3 for depth=3
    
  • Full Table Scan Cost:
    estimatedCost = number_of_leaf_pages;
    
  • Index Scan Cost:
    For a covering index: depth_of_index_btree + number_of_index_leaf_pages.
    For a non-covering index: add number_of_heap_leaf_pages (from main table).

Step 3: Adjust for Query Planner Heuristics

  • Apply Selectivity Factors:
    • Equality constraints (col=42) reduce estimated rows by 90%.
    • Range constraints (col>42) reduce by 33%.
  • Index Uniqueness:
    Multiply cost by 0.1 for unique indexes to reflect faster lookups.

Example:
A non-unique index scan with 100 LogEst rows and 10 leaf pages:

estimatedRows = 100 * 0.33 = 33; // Range constraint
estimatedCost = 10 * 1.0 = 10;   // Non-unique index

Step 4: Validate with .wheretrace and .selecttrace

Enable tracing in the SQLite CLI:

-- Enable detailed tracing:
.selecttrace 0xffff;
.wheretrace 0xfff;

Sample Output Analysis:

WHERE-plan scores: idx1(cost=45.7 rows=20), idx2(cost=50.1 rows=15)
Chosen plan: idx1

Adjust virtual table xBestIndex outputs until the traced costs match real table behavior.


Final Recommendations for Virtual Table Developers

  1. Use LogEst Consistently: Always convert row counts and costs to/from LogEst.
  2. Mirror B-tree Geometry: Derive disk accesses from dbstat data for accuracy.
  3. Emulate Planner Heuristics: Apply selectivity and uniqueness adjustments.
  4. Iterate with Tracing: Use .wheretrace to validate estimates against real tables.

By rigorously applying these steps, virtual tables can achieve query plan parity with real tables, ensuring optimal performance in mixed-table queries.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *