Slow Cross-Database Joins in SQLite: Diagnosis and Optimization


Understanding Performance Discrepancies in Attached Database Joins

The core challenge revolves around a significant performance degradation when joining tables across two attached SQLite databases. A query joining first_db.lookups with first_db.content_index executes in 0.5 seconds, while a structurally similar query joining first_db.lookups with second_db.page_contents (residing in an attached database) takes 20 seconds or more. The problem exacerbates when selecting all columns (*), increasing execution time to 200 seconds. The query planner unexpectedly performs a full table scan on second_db.page_contents despite the presence of a primary key index on content_id. This discrepancy raises questions about SQLite’s handling of attached databases, index utilization, and join order optimizations.

Key observations include:

  • Schema Design: first_db.lookups has an explicit index on content_id, while second_db.page_contents declares content_id as its INTEGER PRIMARY KEY (implicitly indexed). Both tables should theoretically leverage indexed joins.
  • Query Plan Variation: The fast query uses an index scan on lookups followed by a primary key search on content_index. The slow query reverses this logic, scanning page_contents and probing lookups via its index.
  • Impact of CROSS JOIN Syntax: Forcing a CROSS JOIN instead of an implicit INNER JOIN alters the query plan to match the efficient case, resolving the performance issue.
  • Data Characteristics: second_db.page_contents contains large BLOBs, though this alone does not explain the planner’s preference for a full scan.

This issue highlights SQLite’s query planner heuristics, particularly its sensitivity to schema metadata, table locality, and join ordering when working with attached databases. The absence of explicit documentation on attached database optimizations compounds the problem, requiring deeper analysis of planner behavior.


Why SQLite’s Query Planner Misoptimizes Cross-Database Joins

The root cause lies in SQLite’s cost-based query planner making suboptimal decisions when joining tables across attached databases. Three primary factors contribute to this behavior:

  1. Statistics and Schema Metadata Isolation
    SQLite maintains separate metadata and statistics (via sqlite_stat1 tables) for each attached database. When joining tables from different databases, the planner lacks unified statistics to accurately estimate join selectivity or row counts. In this scenario, second_db.page_contents has no sqlite_stat1 entries (unless ANALYZE was explicitly run on it), causing the planner to default to hard-coded heuristics. These heuristics assume smaller tables are better to scan, but miscalculations occur when table sizes or index distributions are not accurately represented.

  2. Index Recognition Across Databases
    While second_db.page_contents.content_id is an INTEGER PRIMARY KEY (and thus indexed), the planner may fail to recognize this index’s availability for joins if the attached database’s schema is not fully loaded into the planner’s working memory. This is particularly true when the main database is in-memory (:memory:), as schema introspection for attached databases occurs lazily during query preparation. If the planner does not detect the primary key index during initial planning, it defaults to a full scan.

  3. Join Order and Temporary Storage Overhead
    The GROUP BY content_id clause forces SQLite to build a temporary B-tree to aggregate results. When joining a large table (lookups with 20M rows) to a smaller but BLOB-heavy table (page_contents with 2M rows), the planner’s choice to scan page_contents first triggers repeated index probes on lookups, multiplying I/O operations. Additionally, materializing BLOB values during the join (e.g., when selecting *) amplifies storage and memory pressure, as BLOBs are read entirely even if only content_id is needed.

The CROSS JOIN syntax works as a manual override, forcing the planner to process lookups first (using its covering index) and then probe page_contents via its primary key. This eliminates the full scan and aligns with the efficient query plan observed in the single-database join.


Resolving Cross-Database Join Performance Degradation

Step 1: Enforce Join Order with CROSS JOIN

Modify the query to explicitly use CROSS JOIN, which disables the planner’s reordering of tables:

SELECT content_id
FROM first_db.lookups
CROSS JOIN second_db.page_contents USING (content_id)
GROUP BY content_id
LIMIT 10;

This forces lookups to be the outer loop, leveraging its content_id_idx index. Each content_id from lookups probes page_contents via its primary key, mirroring the efficient query plan.

Step 2: Validate and Update Statistics with ANALYZE

Run ANALYZE on both databases to ensure the planner has accurate statistics:

ATTACH 'first.db' AS first_db;
ATTACH 'second.db' AS second_db;
ANALYZE first_db;
ANALYZE second_db;

This populates or updates sqlite_stat1 tables, enabling the planner to make informed decisions about join order and index usage.

Step 3: Verify Index Utilization

Confirm that second_db.page_contents has a valid primary key index. While INTEGER PRIMARY KEY automatically creates an index, explicit verification is prudent:

-- For second_db
.indices page_contents

If no index exists (unlikely), recreate the table with proper constraints.

Step 4: Minimize BLOB Overhead in Joins

Avoid selecting unnecessary columns during joins. The original query’s performance degrades further when selecting * due to BLOB materialization. Restrict the result set to only required columns early in the query:

SELECT content_id
FROM (
  SELECT content_id FROM first_db.lookups
) AS l
JOIN (
  SELECT content_id FROM second_db.page_contents
) AS p USING (content_id)
GROUP BY content_id
LIMIT 10;

Step 5: Evaluate Database Consolidation

If cross-database joins remain problematic, consolidate tables into a single database. Attached databases introduce overhead in schema management and statistics aggregation. Test performance after merging:

-- In a new combined database
CREATE TABLE lookups (...) WITHOUT ROWID;
CREATE TABLE page_contents (...) WITHOUT ROWID;

Step 6: Monitor Temporary Storage Configuration

The USE TEMP B-TREE FOR GROUP BY in the query plan indicates heavy temporary storage usage. Configure SQLite’s temp storage to leverage memory instead of disk:

PRAGMA temp_store = MEMORY;

Adjust this pragma based on available RAM, as large B-trees may exhaust memory.

Step 7: Profile I/O and Cache Behavior

Use tools like sqlite3_analyzer or enable SQLite’s I/O profiling to identify bottlenecks:

PRAGMA cell_size_check = ON;
PRAGMA page_size = 4096; -- Match OS page size for optimal I/O

Step 8: Consider Application-Level Caching

For repetitive queries, cache content_id mappings in the application layer to reduce cross-database join frequency.

Step 9: Upgrade SQLite Version

Ensure the latest SQLite version is used, as optimizer improvements are regularly introduced. For instance, version 3.45.0 enhanced join reordering heuristics.

Step 10: Use Subqueries or CTEs for Early Filtering

Push LIMIT and WHERE clauses into subqueries to reduce the working data set before joining:

WITH limited_lookups AS (
  SELECT content_id FROM first_db.lookups LIMIT 10
)
SELECT content_id
FROM limited_lookups
JOIN second_db.page_contents USING (content_id)
GROUP BY content_id;

By systematically applying these steps, developers can mitigate the performance pitfalls of cross-database joins in SQLite, ensuring optimal index usage and planner decisions.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *