Slow Cross-Database Joins in SQLite: Diagnosis and Optimization
Understanding Performance Discrepancies in Attached Database Joins
The core challenge revolves around a significant performance degradation when joining tables across two attached SQLite databases. A query joining first_db.lookups
with first_db.content_index
executes in 0.5 seconds, while a structurally similar query joining first_db.lookups
with second_db.page_contents
(residing in an attached database) takes 20 seconds or more. The problem exacerbates when selecting all columns (*
), increasing execution time to 200 seconds. The query planner unexpectedly performs a full table scan on second_db.page_contents
despite the presence of a primary key index on content_id
. This discrepancy raises questions about SQLite’s handling of attached databases, index utilization, and join order optimizations.
Key observations include:
- Schema Design:
first_db.lookups
has an explicit index oncontent_id
, whilesecond_db.page_contents
declarescontent_id
as itsINTEGER PRIMARY KEY
(implicitly indexed). Both tables should theoretically leverage indexed joins. - Query Plan Variation: The fast query uses an index scan on
lookups
followed by a primary key search oncontent_index
. The slow query reverses this logic, scanningpage_contents
and probinglookups
via its index. - Impact of
CROSS JOIN
Syntax: Forcing aCROSS JOIN
instead of an implicitINNER JOIN
alters the query plan to match the efficient case, resolving the performance issue. - Data Characteristics:
second_db.page_contents
contains large BLOBs, though this alone does not explain the planner’s preference for a full scan.
This issue highlights SQLite’s query planner heuristics, particularly its sensitivity to schema metadata, table locality, and join ordering when working with attached databases. The absence of explicit documentation on attached database optimizations compounds the problem, requiring deeper analysis of planner behavior.
Why SQLite’s Query Planner Misoptimizes Cross-Database Joins
The root cause lies in SQLite’s cost-based query planner making suboptimal decisions when joining tables across attached databases. Three primary factors contribute to this behavior:
Statistics and Schema Metadata Isolation
SQLite maintains separate metadata and statistics (viasqlite_stat1
tables) for each attached database. When joining tables from different databases, the planner lacks unified statistics to accurately estimate join selectivity or row counts. In this scenario,second_db.page_contents
has nosqlite_stat1
entries (unlessANALYZE
was explicitly run on it), causing the planner to default to hard-coded heuristics. These heuristics assume smaller tables are better to scan, but miscalculations occur when table sizes or index distributions are not accurately represented.Index Recognition Across Databases
Whilesecond_db.page_contents.content_id
is anINTEGER PRIMARY KEY
(and thus indexed), the planner may fail to recognize this index’s availability for joins if the attached database’s schema is not fully loaded into the planner’s working memory. This is particularly true when the main database is in-memory (:memory:
), as schema introspection for attached databases occurs lazily during query preparation. If the planner does not detect the primary key index during initial planning, it defaults to a full scan.Join Order and Temporary Storage Overhead
TheGROUP BY content_id
clause forces SQLite to build a temporary B-tree to aggregate results. When joining a large table (lookups
with 20M rows) to a smaller but BLOB-heavy table (page_contents
with 2M rows), the planner’s choice to scanpage_contents
first triggers repeated index probes onlookups
, multiplying I/O operations. Additionally, materializing BLOB values during the join (e.g., when selecting*
) amplifies storage and memory pressure, as BLOBs are read entirely even if onlycontent_id
is needed.
The CROSS JOIN
syntax works as a manual override, forcing the planner to process lookups
first (using its covering index) and then probe page_contents
via its primary key. This eliminates the full scan and aligns with the efficient query plan observed in the single-database join.
Resolving Cross-Database Join Performance Degradation
Step 1: Enforce Join Order with CROSS JOIN
Modify the query to explicitly use CROSS JOIN
, which disables the planner’s reordering of tables:
SELECT content_id
FROM first_db.lookups
CROSS JOIN second_db.page_contents USING (content_id)
GROUP BY content_id
LIMIT 10;
This forces lookups
to be the outer loop, leveraging its content_id_idx
index. Each content_id
from lookups
probes page_contents
via its primary key, mirroring the efficient query plan.
Step 2: Validate and Update Statistics with ANALYZE
Run ANALYZE
on both databases to ensure the planner has accurate statistics:
ATTACH 'first.db' AS first_db;
ATTACH 'second.db' AS second_db;
ANALYZE first_db;
ANALYZE second_db;
This populates or updates sqlite_stat1
tables, enabling the planner to make informed decisions about join order and index usage.
Step 3: Verify Index Utilization
Confirm that second_db.page_contents
has a valid primary key index. While INTEGER PRIMARY KEY
automatically creates an index, explicit verification is prudent:
-- For second_db
.indices page_contents
If no index exists (unlikely), recreate the table with proper constraints.
Step 4: Minimize BLOB Overhead in Joins
Avoid selecting unnecessary columns during joins. The original query’s performance degrades further when selecting *
due to BLOB materialization. Restrict the result set to only required columns early in the query:
SELECT content_id
FROM (
SELECT content_id FROM first_db.lookups
) AS l
JOIN (
SELECT content_id FROM second_db.page_contents
) AS p USING (content_id)
GROUP BY content_id
LIMIT 10;
Step 5: Evaluate Database Consolidation
If cross-database joins remain problematic, consolidate tables into a single database. Attached databases introduce overhead in schema management and statistics aggregation. Test performance after merging:
-- In a new combined database
CREATE TABLE lookups (...) WITHOUT ROWID;
CREATE TABLE page_contents (...) WITHOUT ROWID;
Step 6: Monitor Temporary Storage Configuration
The USE TEMP B-TREE FOR GROUP BY
in the query plan indicates heavy temporary storage usage. Configure SQLite’s temp storage to leverage memory instead of disk:
PRAGMA temp_store = MEMORY;
Adjust this pragma based on available RAM, as large B-trees may exhaust memory.
Step 7: Profile I/O and Cache Behavior
Use tools like sqlite3_analyzer
or enable SQLite’s I/O profiling to identify bottlenecks:
PRAGMA cell_size_check = ON;
PRAGMA page_size = 4096; -- Match OS page size for optimal I/O
Step 8: Consider Application-Level Caching
For repetitive queries, cache content_id
mappings in the application layer to reduce cross-database join frequency.
Step 9: Upgrade SQLite Version
Ensure the latest SQLite version is used, as optimizer improvements are regularly introduced. For instance, version 3.45.0 enhanced join reordering heuristics.
Step 10: Use Subqueries or CTEs for Early Filtering
Push LIMIT
and WHERE
clauses into subqueries to reduce the working data set before joining:
WITH limited_lookups AS (
SELECT content_id FROM first_db.lookups LIMIT 10
)
SELECT content_id
FROM limited_lookups
JOIN second_db.page_contents USING (content_id)
GROUP BY content_id;
By systematically applying these steps, developers can mitigate the performance pitfalls of cross-database joins in SQLite, ensuring optimal index usage and planner decisions.