Performance Discrepancy in Window Functions: SQLite vs. DuckDB Analysis


Understanding the Window Function Performance Gap Between SQLite and DuckDB

The performance difference observed when executing window functions in SQLite versus DuckDB stems from fundamental architectural and operational distinctions between the two database systems. In the given scenario, a query leveraging the LAG() window function to compare sequential exam scores for 1,000,000 students runs 23.5 times slower in SQLite (2.774 seconds) compared to DuckDB (0.118 seconds). This gap is not merely a reflection of code efficiency but rather a combination of design philosophies, concurrency models, and query execution strategies.

Context of the Test Case

The test involves four steps:

  1. Creating a table data with columns student_id, exam_id, and score.
  2. Inserting 2 million rows (1,000,000 students × 2 exams) with initial scores of 100.
  3. Updating three students’ second exam scores to 50.
  4. Identifying students with a score drop exceeding 30 points using a window function.

The critical query structure is:

WITH compared AS (
  SELECT
    student_id,
    score,
    LAG(score) OVER (PARTITION BY student_id ORDER BY exam_id) AS prev_score
  FROM data
)
SELECT student_id
FROM compared
WHERE prev_score - score > 30;

While both databases return the correct result (students 1, 500,000, and 1,000,000), the execution time disparity highlights underlying differences in how SQLite and DuckDB process window functions, manage memory, and utilize hardware resources.


Root Causes of Execution Time Discrepancies in Analytical Queries

1. Concurrency Model and Parallel Execution

DuckDB is designed for analytical workloads and employs a multi-threaded execution engine. It parallelizes operations such as data scanning, window function computation, and predicate evaluation across available CPU cores. In the test, DuckDB’s runtime metrics (real 0.118s, user 0.765s) indicate a Multiprogramming Ratio (MPR) of approximately 7:1, meaning it utilized 7 CPU cores concurrently. This parallelism allows DuckDB to divide the dataset into chunks, process them simultaneously, and merge results efficiently.

SQLite, conversely, operates in single-threaded mode as a transactional OLTP (Online Transaction Processing) database. All operations—index traversal, window function evaluation, and filtering—are executed sequentially on a single CPU core. The runtime metrics (real 2.774s, user 2.765s) confirm near-100% utilization of one core with minimal overhead. While this design ensures ACID compliance and simplicity, it inherently limits performance on compute-intensive analytical queries.

2. Window Function Implementation and Optimization

DuckDB’s window function execution leverages vectorized processing and block-based algorithms. Vectorization processes batches of rows in a single operation, reducing loop overhead and improving cache locality. The LAG() function is optimized to access prior rows within the same partition without rescanning the entire dataset. DuckDB also employs lazy materialization, delaying the computation of window function results until absolutely necessary, thereby minimizing memory consumption.

SQLite’s window function implementation is row-based and non-vectorized. Each row’s LAG() value is computed individually, requiring repeated scans of the partitioned data. Additionally, SQLite materializes the entire result of the compared CTE (Common Table Expression) in memory or temporary storage before applying the WHERE clause. For large datasets, this leads to excessive memory pressure and CPU cache inefficiencies.

3. Index Utilization and Query Planning

The test’s index idx_data (student_id, exam_id) is optimally structured for window functions partitioned by student_id and ordered by exam_id. DuckDB’s query planner recognizes this and pushes down predicates into the window function’s partition logic, avoiding unnecessary data movement. SQLite, however, fails to leverage the index for window function execution. Instead, it performs a full table scan, sorting rows in-memory to satisfy the PARTITION BY student_id ORDER BY exam_id clause. This results in redundant I/O and CPU cycles.

4. Temporary Storage and Memory Management

The PRAGMA temp_store = memory directive in SQLite instructs it to use RAM for temporary objects. However, SQLite’s page-based memory allocator and lack of memory pooling lead to fragmentation when handling large intermediate results. DuckDB, designed for in-memory analytics, uses a columnar storage format and memory pools tailored for analytical operations, reducing allocation overhead and improving data density.

5. Compiler Optimizations and Runtime Overhead

DuckDB is compiled with LLVM-based query compilation, which generates machine code tailored to specific queries. This eliminates interpretation overhead and enables CPU-specific optimizations (e.g., SIMD instructions). SQLite relies on a bytecode interpreter for query execution, introducing per-operation overhead that compounds in complex queries like window functions.


Optimization Strategies and Engine Selection for Window Function Workloads

1. Query Rewriting for SQLite: Avoiding Window Functions

Given SQLite’s limitations with window functions, rewriting the query to use aggregation and conditional logic can yield significant speedups. Keith Medcalf’s alternative approach demonstrates this:

WITH scores AS (
  SELECT
    student_id,
    SUM(CASE WHEN exam_id = 1 THEN score ELSE 0 END) AS exam1,
    SUM(CASE WHEN exam_id = 2 THEN score ELSE 0 END) AS exam2
  FROM data
  GROUP BY student_id
)
SELECT student_id
FROM scores
WHERE exam1 - exam2 > 30;

This query reduces runtime from 2.774s to 0.719s (3.85x faster) by:

  • Pivoting exam scores into columns using SUM(CASE...).
  • Leveraging the index for efficient GROUP BY execution.
  • Eliminating the need for window functions and their associated sorting/memory overhead.

2. Schema Denormalization for Frequent Comparisons

If comparing sequential values (e.g., exam scores) is a common operation, denormalizing the schema to store adjacent values in the same row can improve performance:

CREATE TABLE exams (
  student_id INT PRIMARY KEY,
  exam1_score INT,
  exam2_score INT
);

Queries then reduce to simple arithmetic:

SELECT student_id
FROM exams
WHERE exam1_score - exam2_score > 30;

This eliminates partitioning/sorting entirely and allows index-only scans.

3. Controlled Use of CTEs and Materialization

SQLite’s handling of CTEs can lead to unintended materialization. Rewriting the query to use subqueries or derived tables with LATERAL JOIN (if supported) might reduce temporary storage usage:

SELECT student_id
FROM (
  SELECT
    student_id,
    score - LAG(score) OVER (PARTITION BY student_id ORDER BY exam_id) AS diff
  FROM data
) 
WHERE diff > 30;

However, benchmark variations to determine whether inlining or materialization is more efficient.

4. Selective Indexing and Covering Indexes

Creating a covering index that includes all columns referenced in the query allows SQLite to satisfy the query via an index-only scan:

CREATE INDEX idx_data_covering ON data (student_id, exam_id, score);

This reduces I/O by avoiding table heap accesses.

5. Database Engine Selection Guidelines

  • Choose SQLite for:
    • Embedded applications with transactional workloads.
    • Low-concurrency environments (single writer, multiple readers).
    • Scenarios requiring ACID guarantees and minimal setup.
  • Choose DuckDB for:
    • Analytical queries with complex aggregations/window functions.
    • Read-heavy workloads benefiting from multi-core parallelism.
    • In-memory data processing and OLAP (Online Analytical Processing).

6. Profiling and Diagnostics

  • SQLite’s EXPLAIN QUERY PLAN:
    EXPLAIN QUERY PLAN
    WITH compared AS (...)
    SELECT student_id FROM compared WHERE ...;
    

    Look for SCAN TABLE data (full scan) versus SEARCH TABLE data USING INDEX.

  • DuckDB’s EXPLAIN ANALYZE:
    EXPLAIN ANALYZE
    WITH compared AS (...)
    SELECT student_id FROM compared WHERE ...;
    

    Analyze parallel operator execution (e.g., PARALLEL_AGGREGATE, MULTI_THREADED).

7. Configuration Tuning

  • SQLite:
    • Increase cache_size (PRAGMA cache_size = -1000000; ~1GB) to reduce page faults.
    • Use PRAGMA threads = 4; (if compiled with SQLITE_THREADSAFE=1) for parallel index builds.
  • DuckDB:
    • Adjust threads setting (SET threads TO 8;) to match CPU core count.
    • Use PRAGMA memory_limit='4GB'; to control memory usage.

8. Alternative Libraries and Extensions

  • SQLite Window Function Extensions:
    Compile SQLite with the SQLITE_ENABLE_WINDOW_FUNC optimization flag (if not already enabled) to ensure window functions are available.
  • DuckDB as a SQLite Extension:
    Load DuckDB as a SQLite extension for hybrid workloads:

    SELECT load_extension('duckdb');
    ATTACH 'test.db' AS sqlite_db;
    CREATE TABLE duck_data AS FROM sqlite_db.data;
    -- Execute analytical queries on duck_data
    

9. Hardware-Aware Optimization

  • RAM Allocation: Ensure sufficient memory for both databases to avoid disk-based temporary storage.
  • CPU Affinity: Pin DuckDB worker threads to specific cores (via OS tools) to reduce context-switching overhead.

By understanding the architectural trade-offs and applying targeted optimizations, developers can mitigate SQLite’s window function performance limitations or transition to DuckDB when analytical throughput is critical. The choice ultimately hinges on workload characteristics, concurrency requirements, and operational constraints.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *