and Resolving Index Usage with Inequality and NOT IN Conditions in SQLite


Index Selection Challenges with Inequality and NOT IN Operators

The core issue revolves around SQLite’s query planner opting for full table scans (SCAN TABLE) instead of leveraging available indexes when filtering rows using inequality operators (!=, NOT IN, NOT LIKE). This behavior contrasts with queries using equality (=, IN, LIKE '...%'), where the query planner reliably utilizes indexes (SEARCH TABLE USING INDEX).

Key Observations from Query Plans

Consider the following schema and queries:

CREATE TABLE table_a (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  type INTEGER
);
CREATE INDEX index_a ON table_a(type);
  1. Equality vs. Inequality on Primary Key:

    • SELECT * FROM table_a WHERE id = 1 → Uses primary key index (SEARCH).
    • SELECT * FROM table_a WHERE id != 1 → Full table scan (SCAN).
  2. Equality vs. Inequality on Non-Primary Key:

    • SELECT * FROM table_a WHERE type = 1 → Uses index_a (SEARCH).
    • SELECT * FROM table_a WHERE type != 1 → Full table scan (SCAN).
  3. IN vs. NOT IN:

    • SELECT * FROM table_a WHERE id IN (1) → Uses primary key index.
    • SELECT * FROM table_a WHERE id NOT IN (1) → Full table scan.
  4. LIKE vs. NOT LIKE:

    • SELECT * FROM table_a WHERE type LIKE 'TEST%' → Index used.
    • SELECT * FROM table_a WHERE type NOT LIKE 'TEST%' → Full table scan.

Why This Matters

Indexes are designed to accelerate queries by reducing the number of rows examined. However, the query planner avoids them for inequality conditions due to inherent complexities:

  • Cost Estimation: The planner estimates the cost of index traversal versus a full scan. For conditions that match a large subset of rows (e.g., type != 1 when 75% of rows meet this), traversing the index and fetching rows may be slower than scanning the entire table.
  • Index Structure: B-tree indexes excel at range scans (e.g., type < 2) but struggle with non-contiguous exclusion (e.g., type NOT IN (1, 3)), requiring multiple disjoint index traversals.
  • Covering Index Limitations: Even if an index is used, SQLite must still fetch the full row data from the table unless the index is "covering" (includes all columns in the query). This additional I/O negates the benefit for large result sets.

Factors Influencing Query Planner’s Index Avoidance

1. Data Distribution and Cardinality

  • Low Cardinality Columns: A column with few distinct values (e.g., type with 4 values) may lead the planner to assume that exclusion conditions (e.g., type != 1) still match a large portion of the table. Without statistics, SQLite defaults to pessimistic assumptions.
  • High Selectivity vs. Low Selectivity:
    • High selectivity: Queries returning few rows (e.g., type = 1) benefit from indexes.
    • Low selectivity: Queries returning most rows (e.g., type != 1) are faster via full scans.

2. Lack of Statistical Metadata

SQLite’s query planner relies on stored statistics (generated via ANALYZE) to estimate row counts and index usefulness. Without these statistics:

  • The planner assumes uniform data distribution.
  • It cannot accurately assess whether an index scan would outperform a table scan for non-equality conditions.

3. Index Scan Overhead

  • Row Lookup Penalty: Using a secondary index requires a "rowid lookup" to fetch the full row from the table. For large result sets, this incurs significant overhead.
  • Non-Contiguous Ranges: NOT IN and != often require scanning multiple disjoint index ranges. For example, type NOT IN (2) translates to type = 1 OR type = 3 OR type = 4, which may involve three separate index searches.

4. Query Structure and Index Design

  • Single-Column vs. Composite Indexes: A single-column index on type may not be optimal for exclusion queries if combined with other filters.
  • Covering Indexes: If the index includes all columns referenced in the query (e.g., CREATE INDEX index_a ON table_a(type, id) for SELECT type, id FROM table_a...), the planner may prefer it to avoid row lookups.

Optimizing Index Utilization for Non-Equality Conditions

Step 1: Generate Accurate Statistics with ANALYZE

Run ANALYZE to collect table and index statistics. This enables the query planner to make informed decisions:

ANALYZE table_a;
  • Stores data distribution metrics in sqlite_stat1 (e.g., histogram of type values).
  • Critical for low-cardinality columns where exclusion conditions match a moderate subset (e.g., 25% exclusion in a 4-value column).

Step 2: Rewrite Queries for Index-Friendly Patterns

A. Split Non-Contiguous Ranges with UNION ALL

For type != 2, explicitly list valid ranges:

SELECT * FROM table_a WHERE type < 2
UNION ALL
SELECT * FROM table_a WHERE type > 2;
  • Rewriting as two range queries allows the index to be used for both subqueries.
  • Use UNION ALL instead of UNION to avoid duplicate removal overhead.

B. Convert NOT IN to Explicit IN Clauses

If excluding a small subset of values:

-- Instead of NOT IN (1)
SELECT * FROM table_a WHERE type IN (2, 3, 4);
  • Works best when the included values form contiguous ranges (e.g., type > 1).

C. Use Partial Indexes for Frequent Exclusions

Create an index that excludes commonly filtered values:

CREATE INDEX index_a_type_2_3_4 ON table_a(type) WHERE type != 1;
  • Queries with type != 1 will leverage this index directly.

Step 3: Optimize Index Design

A. Create Covering Indexes

Include all columns referenced in the query to eliminate row lookups:

CREATE INDEX index_a_covering ON table_a(type, id);
  • For SELECT id, type FROM table_a WHERE type != 1, this index contains all needed data.

B. Combine Indexes for Complex Queries

For multi-column filters, create composite indexes:

CREATE INDEX index_a_type_status ON table_a(type, status);
  • Enables efficient filtering on type != 1 AND status = 'active'.

Step 4: Use Query Planner Hints

A. Force Index Usage with INDEXED BY

Override the planner’s choice (use sparingly):

SELECT * FROM table_a INDEXED BY index_a WHERE type != 1;
  • Risks performance degradation if data distribution changes.

B. Guide Selectivity Estimates with LIKELIHOOD

Indicate expected selectivity to the planner:

SELECT * FROM table_a WHERE likelihood(type != 1, 0.75);
  • Suggests that 75% of rows match type != 1, encouraging index usage.

Step 5: Evaluate Table Scan vs. Index Scan Tradeoffs

  • For Large Tables: If a full scan is unavoidable, ensure the table is stored sequentially (e.g., using VACUUM to defragment).
  • For SSD Storage: Table scans are faster than on HDDs, reducing the benefit of index usage for low-selectivity queries.

Step 6: Test with Realistic Data Volumes

  • Example: For a table with 1M rows and 25% exclusion:
    • Compare EXPLAIN QUERY PLAN output before/after optimization.
    • Measure execution time with sqlite3_exec() timestamps or client-side profiling.

Final Example: Optimizing NOT LIKE

Rewrite WHERE type NOT LIKE 'TEST%' as:

SELECT * FROM table_a 
WHERE type < 'TEST' OR type >= 'TEST' + char(127);
  • Exploits lexicographical ordering to define a contiguous exclusion range.

By addressing data statistics, query structure, and index design, developers can coax SQLite into using indexes for non-equality conditions when beneficial. Always validate changes with EXPLAIN QUERY PLAN and real-world performance tests.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *