Optimizing SQLite Query Performance on Large Tables with Composite Indexes

SQLite Query Performance Degradation with 33 Million Rows

When dealing with large datasets in SQLite, performance optimization becomes critical, especially for queries that need to return results in a timely manner. In this scenario, we have a table named item containing 33 million rows, with a schema that includes columns such as id, date_time, list_id, item_id, modified_date_time, title, and json. The primary query in question filters rows using a WHERE clause with conditions on list_id, item_id, and modified_date_time. The query currently takes approximately 150.7ms to execute, which is deemed too slow for the application’s requirements. The goal is to reduce this query time to under 70ms.

The table has an existing index on (list_id, modified_date_time), but the query performance remains suboptimal. Several attempts have been made to improve performance, including experimenting with different indexes and splitting the data into separate tables based on list_id. However, these changes have not yielded the desired performance improvements. The issue is further complicated by the fact that the query returns only a single row, indicating that the filtering conditions are highly selective.

Inefficient Index Usage and Data Distribution

The root cause of the performance issue lies in the inefficiency of the current indexing strategy and the distribution of data within the table. The existing index on (list_id, modified_date_time) is not fully utilized by the query, which also filters on item_id. This results in a significant number of rows being scanned unnecessarily, leading to increased query execution time.

The data distribution of the columns involved in the query also plays a crucial role in performance. If the values of list_id, item_id, and modified_date_time are not evenly distributed, the query optimizer may struggle to choose the most efficient execution plan. For example, if there are many rows with the same list_id and modified_date_time but only a few with the specific item_id, the query may end up scanning a large number of rows before finding the correct one.

Additionally, the absence of the ANALYZE command being run on the database means that the query optimizer lacks statistical information about the data distribution. This can lead to suboptimal query plans being chosen, further exacerbating the performance issue.

Implementing Composite Indexes and Analyzing Data Distribution

To address the performance issue, the following steps should be taken:

  1. Create a Composite Index: A composite index on (list_id, item_id, modified_date_time) should be created. This index will allow the query to efficiently filter rows based on all three columns, reducing the number of rows that need to be scanned. The order of the columns in the index is important; list_id should be the first column, followed by item_id, and then modified_date_time. This order aligns with the filtering conditions in the query and ensures that the index is used effectively.

  2. Run the ANALYZE Command: The ANALYZE command should be executed to gather statistical information about the data distribution in the table. This information will help the query optimizer make better decisions when choosing an execution plan. The ANALYZE command can be run as follows:

    ANALYZE;
    
  3. Review the Query Plan: After creating the composite index and running ANALYZE, the query plan should be reviewed to ensure that the new index is being used. This can be done by prepending EXPLAIN QUERY PLAN to the query:

    EXPLAIN QUERY PLAN
    SELECT * FROM item WHERE list_id=? AND item_id=? AND modified_date_time=?;
    

    The output should indicate that the composite index is being used for the search.

  4. Optimize Data Types: The data types of the columns involved in the query should be optimized. For example, list_id is currently stored as a VARCHAR, but it could be stored as a BLOB to reduce the storage size and improve comparison performance. Similarly, modified_date_time should be stored as an appropriate numeric type rather than a TIMESTAMP to ensure efficient comparisons.

  5. Consider Data Partitioning: If the data distribution is highly skewed, consider partitioning the data into separate tables based on list_id. This can reduce the number of rows that need to be scanned for each query. However, this approach should be used with caution, as it can complicate the schema and increase the complexity of queries.

  6. Monitor and Adjust: After implementing the above changes, monitor the query performance and adjust the indexing strategy as needed. If the query performance is still not meeting the requirements, consider further optimizing the schema or exploring other performance tuning techniques.

By following these steps, the query performance on the item table should be significantly improved, reducing the execution time to under 70ms. The key is to ensure that the query optimizer has the necessary information and tools to choose the most efficient execution plan, and that the data is stored in a way that facilitates fast retrieval.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *