and Resolving Index Usage with Inequality and NOT IN Conditions in SQLite

Index Selection Challenges with Inequality and NOT IN Operators

The core issue revolves around SQLite’s query planner opting for full table scans (SCAN TABLE) instead of leveraging available indexes when filtering rows using inequality operators (!=, NOT IN, NOT LIKE). This behavior contrasts with queries using equality (=, IN, LIKE '...%'), where the query planner reliably utilizes indexes (SEARCH TABLE USING INDEX).

Key Observations from Query Plans

Consider the following schema and queries:

CREATE TABLE table_a (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  type INTEGER
);
CREATE INDEX index_a ON table_a(type);

Equality vs. Inequality on Primary Key:
- SELECT * FROM table_a WHERE id = 1 → Uses primary key index (SEARCH).
- SELECT * FROM table_a WHERE id != 1 → Full table scan (SCAN).
Equality vs. Inequality on Non-Primary Key:
- SELECT * FROM table_a WHERE type = 1 → Uses index_a (SEARCH).
- SELECT * FROM table_a WHERE type != 1 → Full table scan (SCAN).
IN vs. NOT IN:
- SELECT * FROM table_a WHERE id IN (1) → Uses primary key index.
- SELECT * FROM table_a WHERE id NOT IN (1) → Full table scan.
LIKE vs. NOT LIKE:
- SELECT * FROM table_a WHERE type LIKE 'TEST%' → Index used.
- SELECT * FROM table_a WHERE type NOT LIKE 'TEST%' → Full table scan.

Why This Matters

Indexes are designed to accelerate queries by reducing the number of rows examined. However, the query planner avoids them for inequality conditions due to inherent complexities:

Cost Estimation: The planner estimates the cost of index traversal versus a full scan. For conditions that match a large subset of rows (e.g., type != 1 when 75% of rows meet this), traversing the index and fetching rows may be slower than scanning the entire table.
Index Structure: B-tree indexes excel at range scans (e.g., type < 2) but struggle with non-contiguous exclusion (e.g., type NOT IN (1, 3)), requiring multiple disjoint index traversals.
Covering Index Limitations: Even if an index is used, SQLite must still fetch the full row data from the table unless the index is "covering" (includes all columns in the query). This additional I/O negates the benefit for large result sets.

Factors Influencing Query Planner’s Index Avoidance

1. Data Distribution and Cardinality

Low Cardinality Columns: A column with few distinct values (e.g., type with 4 values) may lead the planner to assume that exclusion conditions (e.g., type != 1) still match a large portion of the table. Without statistics, SQLite defaults to pessimistic assumptions.
High Selectivity vs. Low Selectivity:
- High selectivity: Queries returning few rows (e.g., type = 1) benefit from indexes.
- Low selectivity: Queries returning most rows (e.g., type != 1) are faster via full scans.

2. Lack of Statistical Metadata

SQLite’s query planner relies on stored statistics (generated via ANALYZE) to estimate row counts and index usefulness. Without these statistics:

The planner assumes uniform data distribution.
It cannot accurately assess whether an index scan would outperform a table scan for non-equality conditions.

3. Index Scan Overhead

Row Lookup Penalty: Using a secondary index requires a "rowid lookup" to fetch the full row from the table. For large result sets, this incurs significant overhead.
Non-Contiguous Ranges: NOT IN and != often require scanning multiple disjoint index ranges. For example, type NOT IN (2) translates to type = 1 OR type = 3 OR type = 4, which may involve three separate index searches.

4. Query Structure and Index Design

Single-Column vs. Composite Indexes: A single-column index on type may not be optimal for exclusion queries if combined with other filters.
Covering Indexes: If the index includes all columns referenced in the query (e.g., CREATE INDEX index_a ON table_a(type, id) for SELECT type, id FROM table_a...), the planner may prefer it to avoid row lookups.

Optimizing Index Utilization for Non-Equality Conditions

Step 1: Generate Accurate Statistics with ANALYZE

Run ANALYZE to collect table and index statistics. This enables the query planner to make informed decisions:

ANALYZE table_a;

Stores data distribution metrics in sqlite_stat1 (e.g., histogram of type values).
Critical for low-cardinality columns where exclusion conditions match a moderate subset (e.g., 25% exclusion in a 4-value column).

Step 2: Rewrite Queries for Index-Friendly Patterns

A. Split Non-Contiguous Ranges with UNION ALL

For type != 2, explicitly list valid ranges:

SELECT * FROM table_a WHERE type < 2
UNION ALL
SELECT * FROM table_a WHERE type > 2;

Rewriting as two range queries allows the index to be used for both subqueries.
Use UNION ALL instead of UNION to avoid duplicate removal overhead.

B. Convert NOT IN to Explicit IN Clauses

If excluding a small subset of values:

-- Instead of NOT IN (1)
SELECT * FROM table_a WHERE type IN (2, 3, 4);

Works best when the included values form contiguous ranges (e.g., type > 1).

C. Use Partial Indexes for Frequent Exclusions

Create an index that excludes commonly filtered values:

CREATE INDEX index_a_type_2_3_4 ON table_a(type) WHERE type != 1;

Queries with type != 1 will leverage this index directly.

Step 3: Optimize Index Design

A. Create Covering Indexes

Include all columns referenced in the query to eliminate row lookups:

CREATE INDEX index_a_covering ON table_a(type, id);

For SELECT id, type FROM table_a WHERE type != 1, this index contains all needed data.

B. Combine Indexes for Complex Queries

For multi-column filters, create composite indexes:

CREATE INDEX index_a_type_status ON table_a(type, status);

Enables efficient filtering on type != 1 AND status = 'active'.

Step 4: Use Query Planner Hints

A. Force Index Usage with INDEXED BY

Override the planner’s choice (use sparingly):

SELECT * FROM table_a INDEXED BY index_a WHERE type != 1;

Risks performance degradation if data distribution changes.

B. Guide Selectivity Estimates with LIKELIHOOD

Indicate expected selectivity to the planner:

SELECT * FROM table_a WHERE likelihood(type != 1, 0.75);

Suggests that 75% of rows match type != 1, encouraging index usage.

Step 5: Evaluate Table Scan vs. Index Scan Tradeoffs

For Large Tables: If a full scan is unavoidable, ensure the table is stored sequentially (e.g., using VACUUM to defragment).
For SSD Storage: Table scans are faster than on HDDs, reducing the benefit of index usage for low-selectivity queries.

Step 6: Test with Realistic Data Volumes

Example: For a table with 1M rows and 25% exclusion:
- Compare EXPLAIN QUERY PLAN output before/after optimization.
- Measure execution time with sqlite3_exec() timestamps or client-side profiling.

Final Example: Optimizing NOT LIKE

Rewrite WHERE type NOT LIKE 'TEST%' as:

SELECT * FROM table_a 
WHERE type < 'TEST' OR type >= 'TEST' + char(127);

Exploits lexicographical ordering to define a contiguous exclusion range.

By addressing data statistics, query structure, and index design, developers can coax SQLite into using indexes for non-equality conditions when beneficial. Always validate changes with EXPLAIN QUERY PLAN and real-world performance tests.

and Resolving Index Usage with Inequality and NOT IN Conditions in SQLite

Index Selection Challenges with Inequality and NOT IN Operators

Key Observations from Query Plans

Why This Matters

Factors Influencing Query Planner’s Index Avoidance

1. Data Distribution and Cardinality

2. Lack of Statistical Metadata

3. Index Scan Overhead

4. Query Structure and Index Design

Optimizing Index Utilization for Non-Equality Conditions

Step 1: Generate Accurate Statistics with ANALYZE

Step 2: Rewrite Queries for Index-Friendly Patterns

A. Split Non-Contiguous Ranges with UNION ALL

B. Convert NOT IN to Explicit IN Clauses

C. Use Partial Indexes for Frequent Exclusions

Step 3: Optimize Index Design

A. Create Covering Indexes

B. Combine Indexes for Complex Queries

Step 4: Use Query Planner Hints

A. Force Index Usage with INDEXED BY

B. Guide Selectivity Estimates with LIKELIHOOD

Step 5: Evaluate Table Scan vs. Index Scan Tradeoffs

Step 6: Test with Realistic Data Volumes

Final Example: Optimizing NOT LIKE

SQLite Automatic Index Cost Calculation, Join Order Planning, and Size Estimation

Optimizing SQLite Performance in C++ Applications with Slow Database Operations

Efficient Custom Ordering in SQLite FTS with Thread-Based Sorting

Managing SQLite WAL File Growth in High-Volume Logging Systems

Optimizing SQLite Database Import: Order of VACUUM, ANALYZE, and Index Creation

Performance Degradation in SQLite 3.46.1 Update Statements with Covering Index

Leave a Reply Cancel reply

Index Selection Challenges with Inequality and NOT IN Operators

Key Observations from Query Plans

Why This Matters

Factors Influencing Query Planner’s Index Avoidance

1. Data Distribution and Cardinality

2. Lack of Statistical Metadata

3. Index Scan Overhead

4. Query Structure and Index Design

Optimizing Index Utilization for Non-Equality Conditions

Step 1: Generate Accurate Statistics with ANALYZE

Step 2: Rewrite Queries for Index-Friendly Patterns

A. Split Non-Contiguous Ranges with UNION ALL

B. Convert NOT IN to Explicit IN Clauses

C. Use Partial Indexes for Frequent Exclusions

Step 3: Optimize Index Design

A. Create Covering Indexes

B. Combine Indexes for Complex Queries

Step 4: Use Query Planner Hints

A. Force Index Usage with INDEXED BY

B. Guide Selectivity Estimates with LIKELIHOOD

Step 5: Evaluate Table Scan vs. Index Scan Tradeoffs

Step 6: Test with Realistic Data Volumes

Final Example: Optimizing NOT LIKE

Related Guides

Leave a Reply Cancel reply