Performance Degradation in SQLite 1.0.117 with Recursive CTE and System.Data.SQLite
Issue Overview: Performance Regression in Recursive CTE Query with System.Data.SQLite 1.0.117
The core issue revolves around a significant performance degradation observed when executing a recursive Common Table Expression (CTE) query using System.Data.SQLite version 1.0.117 compared to version 1.0.116. The query in question involves a recursive CTE that traverses a hierarchical structure stored in the tbl_Keywords
table, joined with a filtered subset of data from the tbl_LastUsedKeywords
table. In version 1.0.116, the query executes in approximately 20 milliseconds, whereas in version 1.0.117, the execution time balloons to around 2000 milliseconds. This performance regression is particularly perplexing because the query plans generated by SQLite versions 3.38.5 (used in 1.0.116) and 3.40 (used in 1.0.117) are identical, and the command-line versions of these SQLite engines perform equally well. The issue appears to be isolated to the System.Data.SQLite library in version 1.0.117.
The query itself is designed to retrieve a hierarchical path of keywords from tbl_Keywords
, filter them based on recent usage from tbl_LastUsedKeywords
, and exclude specific keywords (Orte
and Personen
) from the results. The query also groups the results by keyword ID, counts occurrences, and limits the output to the top 20 results. The recursive nature of the query, combined with the filtering and grouping operations, makes it sensitive to changes in how SQLite optimizes and executes such queries.
Possible Causes: System.Data.SQLite 1.0.117 and Query Optimization Changes
The performance regression in System.Data.SQLite 1.0.117 can be attributed to several potential causes, each of which warrants careful investigation. The first and most likely cause is a change in how the System.Data.SQLite library interacts with the underlying SQLite engine in version 1.0.117. While the query plans generated by SQLite 3.38.5 and 3.40 are identical, the System.Data.SQLite library may be introducing additional overhead or altering the execution context in a way that negatively impacts performance. This could be due to changes in how the library handles parameter binding, result set retrieval, or memory management.
Another possible cause is the introduction of a breaking change in SQLite 3.40 that affects the optimization of recursive CTEs. Although the query plans are identical, the internal execution logic of recursive CTEs may have been modified in a way that interacts poorly with the System.Data.SQLite library. For example, the pushdown of conditions such as Path not like '%Orte%' and Path not like '%Personen%'
into the recursive branches of the CTE may no longer be occurring as efficiently as in previous versions. This could result in the recursive step running to completion without pruning intermediate results, leading to a significant increase in execution time.
Additionally, the absence of an index on the Parent
column of the tbl_Keywords
table could exacerbate the performance issues. While this was not a problem in version 1.0.116, the changes in version 1.0.117 may have made the lack of this index more impactful. Without an index on the Parent
column, the recursive join operation in the CTE may be performing full table scans, which would be particularly costly in a recursive context.
Finally, the use of the DATETIME
type in the tbl_LastUsedKeywords
table, which is not a native SQLite type, could be contributing to the performance degradation. SQLite stores dates as text, and the implicit conversion between text and datetime values may be handled differently in version 1.0.117, leading to additional overhead.
Troubleshooting Steps, Solutions & Fixes: Diagnosing and Resolving the Performance Regression
To diagnose and resolve the performance regression in System.Data.SQLite 1.0.117, a systematic approach is required. The first step is to confirm that the issue is indeed isolated to the System.Data.SQLite library and not a broader problem with SQLite 3.40. This can be done by executing the query using the command-line versions of SQLite 3.38.5 and 3.40 and comparing the execution times. If the command-line versions perform equally well, as indicated in the discussion, then the issue is likely specific to the System.Data.SQLite library.
Next, the query should be analyzed using the EXPLAIN QUERY PLAN
and EXPLAIN
commands to gain insight into how SQLite is executing the query. While the query plans may appear identical, subtle differences in execution logic may be revealed by examining the detailed output of these commands. Pay particular attention to how conditions such as Path not like '%Orte%' and Path not like '%Personen%'
are being handled in the recursive branches of the CTE. If these conditions are not being pushed down into the recursive branches, as suggested by Keith Medcalf, then manually incorporating these conditions into the CTE may improve performance.
If the issue persists, the next step is to examine the System.Data.SQLite library itself. This can be done by profiling the application to identify any bottlenecks introduced by the library. Tools such as Visual Studio’s Performance Profiler can be used to measure the time spent in various parts of the code, including the execution of the query and the retrieval of results. If the profiling results indicate that the library is introducing significant overhead, then the issue may need to be reported to the maintainers of System.Data.SQLite for further investigation.
In the meantime, several workarounds can be implemented to mitigate the performance regression. One approach is to manually push the filtering conditions into the recursive branches of the CTE, as suggested by Keith Medcalf. This can be done by modifying the CTE to include the conditions directly in the recursive step, rather than applying them after the CTE has been fully evaluated. For example:
with recursive HKeywords (ID, Keyword, Parent, IsSynonym, IsIgnored, Level, ItemType, Path) as
(
-- First level
select a.*, Keyword as Path
from tbl_Keywords a
where a.parent is null
union
-- Recursive levels
select tbl_keywords.*, HKeywords.Path || '|' || tbl_Keywords.Keyword as Path
from tbl_Keywords
join HKeywords on HKeywords.id = tbl_Keywords.Parent
where HKeywords.Path not like '%Orte%' and HKeywords.Path not like '%Personen%'
)
-- Main selection
SELECT HKeywords.ID as KeywordID,
Keyword as Subject,
Path as HSubject
FROM HKeywords
INNER JOIN (select KeywordID from tbl_LastUsedKeywords where TimeStamp > date('now', '-100 days')) b
ON HKeywords.ID = b.KeywordID
WHERE HKeywords.ID is not null
GROUP BY HKeywords.ID
HAVING Count(HKeywords.ID) > 1
Order By Count(HKeywords.ID) DESC
Limit 20;
Another potential workaround is to add an index on the Parent
column of the tbl_Keywords
table. This index would improve the performance of the recursive join operation by allowing SQLite to quickly locate the child rows for each parent. The index can be created using the following DDL statement:
CREATE INDEX idx_tbl_Keywords_Parent ON tbl_Keywords(Parent);
Finally, if the issue is determined to be a bug in the System.Data.SQLite library, then downgrading to version 1.0.116 may be necessary until a fix is available. This can be done by uninstalling version 1.0.117 and reinstalling version 1.0.116 from the System.Data.SQLite website. However, this should be considered a temporary solution, as it may not be feasible to remain on an older version indefinitely.
In conclusion, the performance regression in System.Data.SQLite 1.0.117 is a complex issue that requires careful analysis and systematic troubleshooting. By examining the query execution plan, profiling the application, and implementing targeted workarounds, it is possible to mitigate the impact of this regression and restore the query’s performance to acceptable levels.