Slow Cross-Database Joins in SQLite: Diagnosis and Optimization
Understanding Performance Discrepancies in Attached Database Joins
The core challenge revolves around a significant performance degradation when joining tables across two attached SQLite databases. A query joining first_db.lookups with first_db.content_index executes in 0.5 seconds, while a structurally similar query joining first_db.lookups with second_db.page_contents (residing in an attached database) takes 20 seconds or more. The problem exacerbates when selecting all columns (*), increasing execution time to 200 seconds. The query planner unexpectedly performs a full table scan on second_db.page_contents despite the presence of a primary key index on content_id. This discrepancy raises questions about SQLite’s handling of attached databases, index utilization, and join order optimizations.
Key observations include:
- Schema Design:
first_db.lookupshas an explicit index oncontent_id, whilesecond_db.page_contentsdeclarescontent_idas itsINTEGER PRIMARY KEY(implicitly indexed). Both tables should theoretically leverage indexed joins. - Query Plan Variation: The fast query uses an index scan on
lookupsfollowed by a primary key search oncontent_index. The slow query reverses this logic, scanningpage_contentsand probinglookupsvia its index. - Impact of
CROSS JOINSyntax: Forcing aCROSS JOINinstead of an implicitINNER JOINalters the query plan to match the efficient case, resolving the performance issue. - Data Characteristics:
second_db.page_contentscontains large BLOBs, though this alone does not explain the planner’s preference for a full scan.
This issue highlights SQLite’s query planner heuristics, particularly its sensitivity to schema metadata, table locality, and join ordering when working with attached databases. The absence of explicit documentation on attached database optimizations compounds the problem, requiring deeper analysis of planner behavior.
Why SQLite’s Query Planner Misoptimizes Cross-Database Joins
The root cause lies in SQLite’s cost-based query planner making suboptimal decisions when joining tables across attached databases. Three primary factors contribute to this behavior:
-
Statistics and Schema Metadata Isolation
SQLite maintains separate metadata and statistics (viasqlite_stat1tables) for each attached database. When joining tables from different databases, the planner lacks unified statistics to accurately estimate join selectivity or row counts. In this scenario,second_db.page_contentshas nosqlite_stat1entries (unlessANALYZEwas explicitly run on it), causing the planner to default to hard-coded heuristics. These heuristics assume smaller tables are better to scan, but miscalculations occur when table sizes or index distributions are not accurately represented. -
Index Recognition Across Databases
Whilesecond_db.page_contents.content_idis anINTEGER PRIMARY KEY(and thus indexed), the planner may fail to recognize this index’s availability for joins if the attached database’s schema is not fully loaded into the planner’s working memory. This is particularly true when the main database is in-memory (:memory:), as schema introspection for attached databases occurs lazily during query preparation. If the planner does not detect the primary key index during initial planning, it defaults to a full scan. -
Join Order and Temporary Storage Overhead
TheGROUP BY content_idclause forces SQLite to build a temporary B-tree to aggregate results. When joining a large table (lookupswith 20M rows) to a smaller but BLOB-heavy table (page_contentswith 2M rows), the planner’s choice to scanpage_contentsfirst triggers repeated index probes onlookups, multiplying I/O operations. Additionally, materializing BLOB values during the join (e.g., when selecting*) amplifies storage and memory pressure, as BLOBs are read entirely even if onlycontent_idis needed.
The CROSS JOIN syntax works as a manual override, forcing the planner to process lookups first (using its covering index) and then probe page_contents via its primary key. This eliminates the full scan and aligns with the efficient query plan observed in the single-database join.
Resolving Cross-Database Join Performance Degradation
Step 1: Enforce Join Order with CROSS JOIN
Modify the query to explicitly use CROSS JOIN, which disables the planner’s reordering of tables:
SELECT content_id
FROM first_db.lookups
CROSS JOIN second_db.page_contents USING (content_id)
GROUP BY content_id
LIMIT 10;
This forces lookups to be the outer loop, leveraging its content_id_idx index. Each content_id from lookups probes page_contents via its primary key, mirroring the efficient query plan.
Step 2: Validate and Update Statistics with ANALYZE
Run ANALYZE on both databases to ensure the planner has accurate statistics:
ATTACH 'first.db' AS first_db;
ATTACH 'second.db' AS second_db;
ANALYZE first_db;
ANALYZE second_db;
This populates or updates sqlite_stat1 tables, enabling the planner to make informed decisions about join order and index usage.
Step 3: Verify Index Utilization
Confirm that second_db.page_contents has a valid primary key index. While INTEGER PRIMARY KEY automatically creates an index, explicit verification is prudent:
-- For second_db
.indices page_contents
If no index exists (unlikely), recreate the table with proper constraints.
Step 4: Minimize BLOB Overhead in Joins
Avoid selecting unnecessary columns during joins. The original query’s performance degrades further when selecting * due to BLOB materialization. Restrict the result set to only required columns early in the query:
SELECT content_id
FROM (
SELECT content_id FROM first_db.lookups
) AS l
JOIN (
SELECT content_id FROM second_db.page_contents
) AS p USING (content_id)
GROUP BY content_id
LIMIT 10;
Step 5: Evaluate Database Consolidation
If cross-database joins remain problematic, consolidate tables into a single database. Attached databases introduce overhead in schema management and statistics aggregation. Test performance after merging:
-- In a new combined database
CREATE TABLE lookups (...) WITHOUT ROWID;
CREATE TABLE page_contents (...) WITHOUT ROWID;
Step 6: Monitor Temporary Storage Configuration
The USE TEMP B-TREE FOR GROUP BY in the query plan indicates heavy temporary storage usage. Configure SQLite’s temp storage to leverage memory instead of disk:
PRAGMA temp_store = MEMORY;
Adjust this pragma based on available RAM, as large B-trees may exhaust memory.
Step 7: Profile I/O and Cache Behavior
Use tools like sqlite3_analyzer or enable SQLite’s I/O profiling to identify bottlenecks:
PRAGMA cell_size_check = ON;
PRAGMA page_size = 4096; -- Match OS page size for optimal I/O
Step 8: Consider Application-Level Caching
For repetitive queries, cache content_id mappings in the application layer to reduce cross-database join frequency.
Step 9: Upgrade SQLite Version
Ensure the latest SQLite version is used, as optimizer improvements are regularly introduced. For instance, version 3.45.0 enhanced join reordering heuristics.
Step 10: Use Subqueries or CTEs for Early Filtering
Push LIMIT and WHERE clauses into subqueries to reduce the working data set before joining:
WITH limited_lookups AS (
SELECT content_id FROM first_db.lookups LIMIT 10
)
SELECT content_id
FROM limited_lookups
JOIN second_db.page_contents USING (content_id)
GROUP BY content_id;
By systematically applying these steps, developers can mitigate the performance pitfalls of cross-database joins in SQLite, ensuring optimal index usage and planner decisions.