Accurately Estimating Virtual Table Costs and Rows for SQLite Query Optimization
Understanding SQLite’s Cost and Row Estimation for Virtual Table Integration
When developing virtual tables in SQLite that proxy real tables, achieving parity in query plan efficiency requires precise emulation of how SQLite estimates rows and costs for real tables. Misalignment in these estimates can lead to suboptimal join orders, index selections, and execution strategies. This guide dissects the mechanisms behind SQLite’s internal estimation logic, identifies common pitfalls in virtual table implementations, and provides actionable solutions to align virtual table behavior with real table performance characteristics.
Core Components of SQLite’s Estimation Logic
SQLite’s query planner relies on two critical metrics provided by tables (real or virtual) during optimization:
- Estimated Rows: The approximate number of rows that will be returned by a query operation.
- Estimated Cost: A measure of computational effort (e.g., disk I/O operations) required to execute the operation.
For real tables, these values are derived from a combination of schema metadata, index statistics (sqlite_stat1
), and B-tree structure analysis (dbstat
virtual table). Virtual tables must replicate this logic programmatically in their xBestIndex
method to ensure parity. Below, we explore how SQLite computes these values and how virtual tables can mirror them.
Discrepancies Between Virtual Table Estimates and Real Table Behavior
1. Misinterpretation of LogEst Values
SQLite internally represents row counts and costs using LogEst, a compressed logarithmic scale. Directly using raw values from sqlite_stat1
or dbstat
without conversion leads to mismatches. For example:
- A
sqlite_stat1
entry withnRow=100
might be stored asLogEst=46
(see LogEst conversion). - A virtual table returning
estimatedRows=100
instead ofLogEst=46
will confuse the planner, as.wheretrace
outputs reflect LogEst values.
2. Inaccurate Disk Access Modeling
The estimatedCost
field in xBestIndex
must approximate the number of disk accesses required for the operation. Real tables compute this by analyzing B-tree depth (for rowid lookups) or leaf pages (for scans), but virtual tables often overlook:
- B-tree traversal costs: Accessing a rowid in a B-tree of depth 3 requires 3 disk accesses (root + internal + leaf).
- Index coverage: A covering index avoids heap accesses, reducing cost.
3. Failure to Align with Query Planner Heuristics
SQLite applies heuristics to avoid pathological plans. For instance:
- "WHERE clause optimism": The planner assumes equality constraints (
col=42
) reduce row counts more aggressively than range constraints (col>42
). - Index usability: Unique indexes are preferred over non-unique ones, even if their estimated costs are similar.
Virtual tables that ignore these nuances will produce incompatible cost/row estimates, leading to suboptimal plans.
Aligning Virtual Table Estimates with SQLite’s Internal Logic
Step 1: Convert sqlite_stat1
Values to LogEst
The sqlite_stat1
table stores statistics in LogEst format. Use the following conversion tools:
- LogEst to Integer:
int logEstToInt(int logEst) { return (int)(pow(2, logEst/10.0) + 0.5); }
- Integer to LogEst:
int intToLogEst(int n) { return (n == 0) ? 0 : (int)(10.0 * log2(n)); }
Example:
If sqlite_stat1
reports nRow=46
(LogEst for 100 rows), a virtual table should return estimatedRows=46
instead of 100.
Step 2: Compute Disk Accesses Using dbstat
The dbstat
virtual table exposes B-tree geometry. For a real table tbl
:
-- Get B-tree depth and leaf pages:
SELECT name, path, pageno, pagetype, ncell
FROM dbstat WHERE name='tbl';
- Rowid Lookup Cost:
estimatedCost = depth_of_btree; // e.g., 3 for depth=3
- Full Table Scan Cost:
estimatedCost = number_of_leaf_pages;
- Index Scan Cost:
For a covering index:depth_of_index_btree + number_of_index_leaf_pages
.
For a non-covering index: addnumber_of_heap_leaf_pages
(from main table).
Step 3: Adjust for Query Planner Heuristics
- Apply Selectivity Factors:
- Equality constraints (
col=42
) reduce estimated rows by 90%. - Range constraints (
col>42
) reduce by 33%.
- Equality constraints (
- Index Uniqueness:
Multiply cost by0.1
for unique indexes to reflect faster lookups.
Example:
A non-unique index scan with 100 LogEst rows and 10 leaf pages:
estimatedRows = 100 * 0.33 = 33; // Range constraint
estimatedCost = 10 * 1.0 = 10; // Non-unique index
Step 4: Validate with .wheretrace
and .selecttrace
Enable tracing in the SQLite CLI:
-- Enable detailed tracing:
.selecttrace 0xffff;
.wheretrace 0xfff;
Sample Output Analysis:
WHERE-plan scores: idx1(cost=45.7 rows=20), idx2(cost=50.1 rows=15)
Chosen plan: idx1
Adjust virtual table xBestIndex
outputs until the traced costs match real table behavior.
Final Recommendations for Virtual Table Developers
- Use LogEst Consistently: Always convert row counts and costs to/from LogEst.
- Mirror B-tree Geometry: Derive disk accesses from
dbstat
data for accuracy. - Emulate Planner Heuristics: Apply selectivity and uniqueness adjustments.
- Iterate with Tracing: Use
.wheretrace
to validate estimates against real tables.
By rigorously applying these steps, virtual tables can achieve query plan parity with real tables, ensuring optimal performance in mixed-table queries.