Optimizing Geopoly Queries with FTS5 in SQLite
Geopoly and FTS5 Performance Discrepancy in Combined Queries
The core issue revolves around the significant performance discrepancy observed when combining geopoly_within
spatial queries with FTS5
full-text search queries in SQLite. While both geopoly
and FTS5
perform reasonably well in isolation, their combination results in a drastic slowdown, making the query execution time unacceptable for practical use. This issue is particularly evident when attempting to perform a spatial search within a polygon (using geopoly_within
) alongside a full-text search (using FTS5
), where the query execution time increases by orders of magnitude compared to using either feature independently.
The schema in question involves three main tables: treatments
, materialsCitations
, and two virtual tables (vtreatments
for full-text search and vlocations
for spatial indexing). The vlocations
table is populated with tiny polygons constructed around latitude and longitude points from the materialsCitations
table. The problem arises when joining these tables to perform a combined spatial and full-text search, where the query execution time balloons to over 46 seconds, compared to sub-second execution times for individual FTS5
or geopoly
queries.
Inefficient Join Operations and Virtual Table Indexing
The primary cause of the performance degradation lies in the inefficiency of join operations between the virtual tables (vtreatments
and vlocations
) and the main tables (treatments
and materialsCitations
). When SQLite executes the combined query, it performs a full scan of the vlocations
table, which is inherently slower due to the computational complexity of spatial operations. Additionally, the lack of optimal indexing strategies for virtual tables exacerbates the issue, as SQLite cannot leverage efficient index-based joins between the virtual and physical tables.
The geopoly
extension, which relies on R-tree indexing for spatial queries, is not as performant as FTS5
for full-text searches. This is because spatial queries involve more complex geometric calculations, such as determining whether a point lies within a polygon, which is computationally intensive compared to the token-based matching used in FTS5
. Furthermore, the geopoly
extension does not provide the same level of optimization for join operations as FTS5
, leading to slower query execution when combining spatial and full-text searches.
Another contributing factor is the way SQLite handles virtual tables. Virtual tables, such as those created with FTS5
and geopoly
, do not support the same indexing mechanisms as regular tables. This limitation forces SQLite to resort to full scans or less efficient indexing strategies when joining virtual tables with physical tables, resulting in slower query performance. The lack of direct support for compound indexes that span both virtual and physical tables further compounds the issue, making it difficult to optimize queries that involve both spatial and full-text search criteria.
Optimizing Geopoly Queries with Intersect and Alternative Approaches
To address the performance issues, several strategies can be employed to optimize the combined use of geopoly
and FTS5
in SQLite. One effective approach is to use the INTERSECT
operator to combine the results of separate FTS5
and geopoly
queries, rather than joining the tables directly. This method leverages the strengths of each virtual table independently and then combines the results, reducing the computational overhead associated with join operations.
For example, the following query uses the INTERSECT
operator to combine the results of a full-text search with a spatial search:
SELECT Count(*)
FROM treatments
WHERE treatmentId IN (
SELECT treatmentid
FROM vtreatments
WHERE vtreatments MATCH 'meshram'
INTERSECT
SELECT treatmentid
FROM vlocations
WHERE geopoly_within(_shape, poly) != 0
);
This query structure allows SQLite to execute the FTS5
and geopoly
queries separately and then combine the results using the INTERSECT
operator, which is more efficient than performing a direct join between the virtual tables. The execution time for this query is significantly faster, at around 1.8 seconds, compared to the original 46-second query.
Another optimization strategy is to use bounding box (bbox) queries instead of geopoly_within
for spatial searches. Bbox queries are simpler and faster because they involve straightforward comparisons of latitude and longitude values, rather than complex geometric calculations. For example, the following query uses a bbox to perform a spatial search:
SELECT Count(*)
FROM treatments
JOIN materialsCitations ON treatments.treatmentId = materialsCitations.treatmentId
JOIN vtreatments ON treatments.treatmentId = vtreatments.treatmentId
WHERE vtreatments MATCH 'meshram'
AND latitude BETWEEN min_lat AND max_lat
AND longitude BETWEEN min_lng AND max_lng;
This query executes in just a few milliseconds, making it a viable alternative to geopoly_within
for many use cases. However, it is important to note that bbox queries are less precise than geopoly_within
, as they only approximate the search area with a rectangle rather than a polygon.
For scenarios where the precision of geopoly_within
is required, consider using the R-tree
module directly instead of the geopoly
extension. The R-tree
module provides more control over spatial indexing and can be more efficient for certain types of spatial queries. Additionally, the R-tree
module supports auxiliary columns, which can be used to optimize queries that involve both spatial and non-spatial criteria.
Finally, if the performance of geopoly
and R-tree
queries remains insufficient for your needs, consider using a specialized spatial database such as PostGIS. While this option introduces additional complexity, it provides significantly better performance for spatial queries and supports a wider range of spatial operations. However, for many applications, the optimizations described above should be sufficient to achieve acceptable performance without the need to migrate to a different database system.
In conclusion, the performance issues associated with combining geopoly
and FTS5
queries in SQLite can be mitigated through careful query design and optimization. By leveraging the INTERSECT
operator, using bbox queries, and exploring alternative indexing strategies, it is possible to achieve significant improvements in query execution time while maintaining the precision and functionality required for spatial and full-text searches.