Optimizing SQLite Query Performance on Large Tables with Composite Indexes
SQLite Query Performance Degradation with 33 Million Rows
When dealing with large datasets in SQLite, performance optimization becomes critical, especially for queries that need to return results in a timely manner. In this scenario, we have a table named item
containing 33 million rows, with a schema that includes columns such as id
, date_time
, list_id
, item_id
, modified_date_time
, title
, and json
. The primary query in question filters rows using a WHERE
clause with conditions on list_id
, item_id
, and modified_date_time
. The query currently takes approximately 150.7ms to execute, which is deemed too slow for the application’s requirements. The goal is to reduce this query time to under 70ms.
The table has an existing index on (list_id, modified_date_time)
, but the query performance remains suboptimal. Several attempts have been made to improve performance, including experimenting with different indexes and splitting the data into separate tables based on list_id
. However, these changes have not yielded the desired performance improvements. The issue is further complicated by the fact that the query returns only a single row, indicating that the filtering conditions are highly selective.
Inefficient Index Usage and Data Distribution
The root cause of the performance issue lies in the inefficiency of the current indexing strategy and the distribution of data within the table. The existing index on (list_id, modified_date_time)
is not fully utilized by the query, which also filters on item_id
. This results in a significant number of rows being scanned unnecessarily, leading to increased query execution time.
The data distribution of the columns involved in the query also plays a crucial role in performance. If the values of list_id
, item_id
, and modified_date_time
are not evenly distributed, the query optimizer may struggle to choose the most efficient execution plan. For example, if there are many rows with the same list_id
and modified_date_time
but only a few with the specific item_id
, the query may end up scanning a large number of rows before finding the correct one.
Additionally, the absence of the ANALYZE
command being run on the database means that the query optimizer lacks statistical information about the data distribution. This can lead to suboptimal query plans being chosen, further exacerbating the performance issue.
Implementing Composite Indexes and Analyzing Data Distribution
To address the performance issue, the following steps should be taken:
Create a Composite Index: A composite index on
(list_id, item_id, modified_date_time)
should be created. This index will allow the query to efficiently filter rows based on all three columns, reducing the number of rows that need to be scanned. The order of the columns in the index is important;list_id
should be the first column, followed byitem_id
, and thenmodified_date_time
. This order aligns with the filtering conditions in the query and ensures that the index is used effectively.Run the ANALYZE Command: The
ANALYZE
command should be executed to gather statistical information about the data distribution in the table. This information will help the query optimizer make better decisions when choosing an execution plan. TheANALYZE
command can be run as follows:ANALYZE;
Review the Query Plan: After creating the composite index and running
ANALYZE
, the query plan should be reviewed to ensure that the new index is being used. This can be done by prependingEXPLAIN QUERY PLAN
to the query:EXPLAIN QUERY PLAN SELECT * FROM item WHERE list_id=? AND item_id=? AND modified_date_time=?;
The output should indicate that the composite index is being used for the search.
Optimize Data Types: The data types of the columns involved in the query should be optimized. For example,
list_id
is currently stored as aVARCHAR
, but it could be stored as aBLOB
to reduce the storage size and improve comparison performance. Similarly,modified_date_time
should be stored as an appropriate numeric type rather than aTIMESTAMP
to ensure efficient comparisons.Consider Data Partitioning: If the data distribution is highly skewed, consider partitioning the data into separate tables based on
list_id
. This can reduce the number of rows that need to be scanned for each query. However, this approach should be used with caution, as it can complicate the schema and increase the complexity of queries.Monitor and Adjust: After implementing the above changes, monitor the query performance and adjust the indexing strategy as needed. If the query performance is still not meeting the requirements, consider further optimizing the schema or exploring other performance tuning techniques.
By following these steps, the query performance on the item
table should be significantly improved, reducing the execution time to under 70ms. The key is to ensure that the query optimizer has the necessary information and tools to choose the most efficient execution plan, and that the data is stored in a way that facilitates fast retrieval.