Performance Degradation with DISTINCT Due to Indexing and Materialization Issues

Understanding the Shift from COVERING INDEX to INDEX with DISTINCT

The core issue revolves around a significant performance degradation when introducing the DISTINCT keyword in a query involving Common Table Expressions (CTEs) and joins. The original query executes in 142 seconds without DISTINCT but escalates to ~10 hours when DISTINCT is applied. The query plan reveals a critical shift from using a COVERING INDEX to a standard INDEX scan, accompanied by the creation of a temporary B-tree structure to enforce uniqueness. This behavior is observed across multiple scenarios, including CTE materialization attempts and joins.

Key observations include:

  1. Index Utilization Differences: The absence of DISTINCT allows SQLite to leverage a covering index (object_hashes_hash), which includes all columns required for the query. When DISTINCT is introduced, the query planner reverts to a non-covering index (object_hashes_objid), forcing additional table lookups.
  2. Temporary B-Tree Overhead: The USE TEMP B-TREE FOR DISTINCT step introduces significant I/O and computational costs, especially for large datasets (e.g., 4.4 million rows).
  3. Materialization Pitfalls: Attempts to materialize intermediate results (e.g., using MATERIALIZED CTEs) inadvertently disrupt index usage, leading to full scans and slower execution.

Root Causes of Performance Degradation

1. Incomplete Index Coverage for DISTINCT Requirements

The DISTINCT operation requires uniqueness across the combination of item_id, cntr_id, and hash_id. The existing index object_hashes_objid is defined as (item_id, cntr_id), which does not include hash_id. When DISTINCT is applied, SQLite must fetch hash_id from the underlying table, bypassing the covering index. This forces:

  • Rowid Lookups: Each entry in the index requires a secondary lookup into the object_hashes table to retrieve hash_id.
  • Increased I/O: For large datasets, this results in random disk accesses, exacerbating latency.

2. Materialization Disrupts Index Propagation

Materializing CTEs (e.g., hshs or dupe_hashes) creates temporary tables that lack the original indexes. For example:

  • The hshs CTE materializes hash_id values but does not inherit the object_hashes_hash index. Subsequent joins against object_hashes revert to non-covering index scans.
  • Materialized CTEs force the optimizer to treat them as static datasets, preventing dynamic index usage during query execution.

3. Suboptimal Query Planner Decisions with LIKELIHOOD Hints

The use of LIKELIHOOD modifiers (e.g., LIKELIHOOD(COUNT(*) > 1, 0.17021)) may mislead the query planner’s cost estimates. These hints suggest low probabilities for certain conditions, but the actual data distribution (e.g., many duplicate hash_id values) contradicts these assumptions. This results in:

  • Inefficient Join Orders: The planner prioritizes filtering steps that are computationally expensive due to incorrect selectivity estimates.
  • Prematerialization of Intermediate Results: The planner may materialize subqueries too early, losing opportunities for index optimizations.

4. Temporary B-Tree Construction Overhead

The USE TEMP B-TREE FOR DISTINCT step involves sorting and deduplicating rows in a temporary structure. For 4.4 million rows:

  • Memory Pressure: If the temporary B-tree exceeds available memory, SQLite spills to disk, causing significant slowdowns.
  • Redundant Sorting: The temporary B-tree may sort data that is already partially ordered by existing indexes, wasting resources.

Optimizing Query Performance: Strategies and Fixes

1. Revise Indexing Strategy for DISTINCT Columns

Create a Composite Covering Index:

CREATE INDEX object_hashes_covering ON object_hashes (hash_id, item_id, cntr_id);

This index includes all columns required by the DISTINCT clause, allowing SQLite to:

  • Perform index-only scans for uniqueness checks.
  • Avoid table lookups when fetching item_id and cntr_id.

Impact Analysis:

  • The dupe_hashes CTE will use the covering index, eliminating rowid lookups.
  • The DISTINCT operation in uniq_hashes can leverage the same index, reducing the need for a temporary B-tree.

Verification:
After creating the index, check the query plan:

QUERY PLAN
|--CO-ROUTINE raw_dup_hashes
| |--SCAN oh USING COVERING INDEX object_hashes_covering
...
`--SEARCH oh USING COVERING INDEX object_hashes_covering (hash_id=?)

The absence of USE TEMP B-TREE FOR DISTINCT indicates successful optimization.


2. Avoid Premature Materialization of CTEs

Instead of materializing CTEs like hshs or dupe_hashes, force index retention by:

  • Using subqueries with INDEXED BY hints (if necessary).
  • Rewriting the query to keep intermediate results inlined.

Modified Query Structure:

WITH RECURSIVE
raw_dup_hashes AS (
  SELECT oh.hash_id
  FROM object_hashes oh INDEXED BY object_hashes_covering
  WHERE oh.hash_id != (SELECT id FROM hashes WHERE hash = x'e3b0c4...' LIMIT 1)
  GROUP BY oh.hash_id
  HAVING COUNT(*) > 1
),
dupe_hashes AS (
  SELECT oh.item_id, oh.cntr_id, r.hash_id
  FROM raw_dup_hashes r
  JOIN hashes h ON h.id = r.hash_id
  JOIN object_hashes oh INDEXED BY object_hashes_covering ON oh.hash_id = r.hash_id
  WHERE h.type = (SELECT id FROM hash_types WHERE name = 'sha256')
)
SELECT COUNT(DISTINCT item_id || ',' || cntr_id || ',' || hash_id)
FROM dupe_hashes;

Key Adjustments:

  • Removed LIKELIHOOD hints to allow accurate cost estimation.
  • Used INDEXED BY to enforce covering index usage.
  • Replaced DISTINCT with a concatenated count to avoid temporary B-trees (if uniqueness is guaranteed).

3. Leverage Partial Indexes for Filtered Data

If hash_type = 'sha256' and exclusion of hash_id = x'e3b0c4...' are common filters, create a partial index:

CREATE INDEX object_hashes_filtered ON object_hashes (hash_id, item_id, cntr_id)
WHERE hash_id IN (
  SELECT id FROM hashes
  WHERE type = (SELECT id FROM hash_types WHERE name = 'sha256')
) AND hash_id != x'e3b0c4...';

Benefits:

  • Reduces the index size by excluding irrelevant rows.
  • Directly supports the query’s filtering conditions, accelerating raw_dup_hashes materialization.

4. Optimize Temporary Storage for DISTINCT Operations

If a temporary B-tree is unavoidable, configure SQLite to use in-memory storage:

PRAGMA temp_store = MEMORY;
PRAGMA cache_size = -200000;  -- Allocate 200MB of cache

Considerations:

  • Ensure sufficient memory is available to avoid disk spillage.
  • Monitor performance with EXPLAIN QUERY PLAN to verify reduced disk I/O.

5. Refactor the Query to Use EXISTS Semantics

Eliminate DISTINCT by rewriting the query to check for existence:

WITH dup_candidates AS (
  SELECT hash_id
  FROM object_hashes
  GROUP BY hash_id
  HAVING COUNT(*) > 1
)
SELECT COUNT(*)
FROM (
  SELECT oh.item_id, oh.cntr_id, oh.hash_id
  FROM object_hashes oh
  WHERE EXISTS (
    SELECT 1
    FROM dup_candidates dc
    WHERE dc.hash_id = oh.hash_id
  )
  GROUP BY oh.item_id, oh.cntr_id, oh.hash_id
);

Advantages:

  • Avoids DISTINCT by leveraging GROUP BY with a covering index.
  • Uses EXISTS for efficient membership testing.

6. Update Statistics for Accurate Query Planning

Outdated statistics can lead to poor index selection. Refresh statistics using:

ANALYZE;

Post-Analysis:

  • Check sqlite_stat1 tables to ensure the query planner has up-to-date cardinality estimates.
  • Consider increasing the sample count with ANALYZE sqlite_stat4 for large tables.

7. Benchmark Index Scans vs. Table Scans

For large datasets, a full table scan may outperform index lookups if the selectivity is high. Force a table scan to compare:

SELECT COUNT(DISTINCT item_id, cntr_id, hash_id)
FROM object_hashes NOT INDEXED;

Interpretation:

  • If the table scan is faster, consider dropping redundant indexes to reduce overhead.
  • Use covering indexes only when they provide a measurable speedup.

8. Utilize Window Functions for Deduplication

Replace DISTINCT with a window function to avoid temporary B-trees:

WITH ranked_hashes AS (
  SELECT
    item_id,
    cntr_id,
    hash_id,
    ROW_NUMBER() OVER (PARTITION BY item_id, cntr_id, hash_id ORDER BY 1) AS rn
  FROM dupe_hashes
)
SELECT COUNT(*)
FROM ranked_hashes
WHERE rn = 1;

Trade-offs:

  • Introduces window function overhead but may avoid temporary storage.
  • Requires testing to validate performance gains.

9. Profile I/O and CPU Utilization

Use SQLite’s profiling tools to identify bottlenecks:

# Monitor I/O
strace -e trace=file sqlite3 database.db < query.sql

# Profile CPU
valgrind --tool=callgrind sqlite3 database.db < query.sql

Actions:

  • If I/O dominates, prioritize covering indexes or increase cache size.
  • If CPU is saturated, simplify query logic or reduce sorting steps.

10. Consider Application-Level Deduplication

If all else fails, delegate deduplication to the application layer:

-- Fetch non-distinct results
SELECT item_id, cntr_id, hash_id FROM dupe_hashes;

-- Deduplicate in code (e.g., Python)
rows = cursor.fetchall()
unique_rows = { (r[0], r[1], r[2]) for r in rows }
print(len(unique_rows))

Use Case:

  • Effective when network transfer time is negligible compared to database processing.
  • Allows leveraging in-memory structures (e.g., Python sets) for faster deduplication.

By systematically addressing index coverage, materialization strategies, and query planner interactions, the performance degradation caused by DISTINCT can be mitigated. The optimal solution often involves a combination of indexing adjustments, query refactoring, and configuration tuning tailored to the specific data distribution and access patterns.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *