Segfaults When Using FTS5 Tokenizer v2 API After Closing Connection
Understanding FTS5 Tokenizer API Lifetime Changes and Segmentation Faults
FTS5 Tokenizer API Version Differences and Memory Management
The core issue revolves around behavioral changes between the FTS5 tokenizer v1 and v2 APIs in SQLite, specifically regarding object lifetime management and memory ownership. When using the v2 API, attempts to free tokenizer objects after closing their parent database connection result in segmentation faults, while the same pattern works correctly with v1. This discrepancy stems from fundamental architectural differences in how these API versions handle tokenizer object ownership and lifecycle management.
In the v1 API (fts5_tokenizer
), developers supply their own struct instance that SQLite populates with function pointers. The caller retains full ownership of this memory, allowing tokenizer instances to persist beyond connection closure without conflict. The v2 API (fts5_tokenizer_v2
) reverses this relationship: SQLite provides a pre-allocated struct containing both the tokenizer implementation and internal state. This structure becomes invalid when the parent database connection closes, as FTS5 automatically cleans up associated resources. Attempting to use or free these now-invalid pointers leads to undefined behavior, including crashes.
The critical distinction lies in ownership semantics:
- v1: Caller-owned tokenizer struct with SQLite-populated function pointers
- v2: SQLite-owned tokenizer struct containing implementation details
This architectural shift introduces strict lifetime coupling between database connections and tokenizer instances in v2. While v1 allowed tokenizers to outlive their originating connections (provided developers managed memory correctly), v2 intrinsically links tokenizer validity to connection lifetime. The segmentation fault occurs because closing the connection triggers FTS5’s cleanup routines, which deallocate the tokenizer struct. Subsequent attempts to interact with the tokenizer (including explicit deletion) access freed memory.
Root Causes of Post-Connection Closure Tokenizer Invalidations
Three primary factors contribute to this behavioral divergence:
Versioned Object Handling in v2 API
Thefts5_tokenizer_v2
structure contains version-specific fields that may expand over time. By returning SQLite-owned structs, FTS5 ensures binary compatibility across versions. However, this design transfers memory management responsibility to the database engine, creating implicit lifetime dependencies. When connections close, SQLite immediately reclaims these version-sensitive structures rather than waiting for explicit developer cleanup.Absence of Reference Counting
Unlike some SQLite APIs that employ reference counting for shared resources, FTS5 tokenizers maintain a direct owner-relationship with their parent connection. The v2 API provides no mechanism to extend tokenizer lifetime beyond connection closure, even if outstanding references exist. This contrasts with v1’s caller-owned model, where developers could implement custom reference tracking.Callback Function Ownership
The v2 API bundles tokenizer implementation details (xCreate, xTokenize, xDelete) within SQLite-managed memory. When the connection closes:- Tokenizer structs containing these function pointers get deallocated
- Any copies of the pointers become dangling references
- Subsequent calls through these pointers (even if captured earlier) access invalid memory
This explains why simply copying the xDelete/xTokenize function pointers from v2 structs doesn’t prevent crashes – the implementations behind these pointers may rely on connection-specific context that gets destroyed during closure.
Resolving Tokenizer Lifetime Conflicts in FTS5 v2 Implementations
Immediate Mitigation Strategies
Connection Lifetime Alignment
Restructure code to ensure tokenizer objects never outlive their parent connections. In garbage-collected environments like Python:with apsw.Connection(":memory:") as conn: tokenizer = conn.fts5_tokenizer("unicode61") # Use tokenizer within connection scope # Tokenizer automatically invalidated post-with-block
Manually nullify tokenizer references before closing connections in non-GC contexts.
API Version Selection
Continue using the v1 API (fts5_tokenizer
) when long-lived tokenizers are necessary. The v1 pattern remains valid and safe when properly managed:fts5_tokenizer tok; pApi->xFindTokenizer(pApi, "unicode61", &tok); /* Safe to use tok after connection closure */
Function Pointer Isolation
Extract and persist v2 tokenizer functions independently of their container struct:class TokenizerWrapper: def __init__(self, v2_struct): self.xCreate = v2_struct.xCreate self.xTokenize = v2_struct.xTokenize self.xDelete = v2_struct.xDelete conn = apsw.Connection(":memory:") raw_tok = conn.fts5_tokenizer("unicode61") wrapped_tok = TokenizerWrapper(raw_tok) conn.close() # Use wrapped_tok.xDelete() cautiously
This captures the necessary functions while decoupling from SQLite’s memory management, though proper cleanup timing remains crucial.
Long-Term Architectural Solutions
Connection Closure Hooks
Implement connection-specific cleanup routines that invalidate associated tokenizers:typedef struct { fts5_tokenizer_v2* tok; sqlite3* db; } TokenizerContext; void connection_close_handler(void* data) { TokenizerContext* ctx = (TokenizerContext*)data; ctx->tok = NULL; // Invalidate before SQLite cleanup sqlite3_free(ctx); } // When creating tokenizer: TokenizerContext* ctx = sqlite3_malloc(sizeof(*ctx)); ctx->db = db; pApi->xFindTokenizer_v2(pApi, "unicode61", NULL, &ctx->tok); sqlite3_close_hook(db, connection_close_handler, ctx);
Custom Reference Counting
Wrap v2 tokenizers in reference-counted containers that coordinate with connection state:import weakref class SafeTokenizer: def __init__(self, conn, name): self._conn_ref = weakref.ref(conn) self._tok = conn.fts5_tokenizer(name) self._active = True def __del__(self): if self._active: self._tok.xDelete() def close(self): if self._active and self._conn_ref() is not None: self._tok.xDelete() self._active = False # Usage: conn = apsw.Connection(":memory:") tok = SafeTokenizer(conn, "unicode61") conn.close() tok.close() # Manual cleanup before GC
API Version Detection and Adaptation
Implement runtime checks to select optimal API usage based on availability:#if SQLITE_VERSION_NUMBER >= 3045000 #define USE_FTS5_V2 1 #else #define USE_FTS5_V2 0 #endif void setup_tokenizer(fts5_api* pApi, const char* name) { #if USE_FTS5_V2 fts5_tokenizer_v2* tok; if( pApi->xFindTokenizer_v2(pApi, name, NULL, &tok) ){ // Handle error } // Track connection lifetime #else fts5_tokenizer tok; if( pApi->xFindTokenizer(pApi, name, &tok) ){ // Handle error } // Caller manages lifetime #endif }
Deep Dive: SQLite Internal Mechanics
Understanding FTS5’s module registration helps explain the lifetime constraints. When using xCreateTokenizer
, SQLite stores tokenizer implementations in a connection-specific registry. The v2 API returns pointers to these registered implementations, which get destroyed during sqlite3_close_v2
:
Connection Closure Sequence
- Invoke
sqlite3_close()
- Call
sqlite3LeaveMutexAndCloseZombie()
- Trigger
sqlite3VtabModuleUnref()
for each registered module - Execute
fts5ModuleDestroy()
for FTS5 modules - Free tokenizer structures via
sqlite3_free()
- Invoke
Tokenization Process Flow
xFindTokenizer_v2
retrieves module from connection registry- Tokenizer struct contains pointers to current connection’s resources
- Post-closure, these pointers reference deallocated memory
Memory Sanitizer Diagnostics
Address sanitizers detect invalid accesses through:- Use-after-free: Accessing tokenizer struct post-closure
- Heap-use-after-free: Calling xDelete/xTokenize after parent struct deallocation
- Invalid pointer dereference: Using stale function pointers
Best Practices for Stable Tokenizer Management
Lifetime Documentation
Explicitly document tokenizer-connection relationships:"FTS5 tokenizers obtained via
xFindTokenizer_v2
remain valid only while their parent database connection exists. Applications must ensure all tokenizer references are discarded before connection closure."Automated Lifetime Tracking
Integrate tokenizer management with connection objects:class ManagedConnection(apsw.Connection): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self._tokenizers = [] def fts5_tokenizer(self, name): tok = super().fts5_tokenizer(name) self._tokenizers.append(tok) return tok def close(self): for tok in self._tokenizers: tok.xDelete() # Optional based on API version self._tokenizers.clear() super().close()
Cross-Version Compatibility Layers
Create abstraction layers that normalize API differences:typedef struct { int version; union { fts5_tokenizer v1; fts5_tokenizer_v2* v2; }; } UnifiedTokenizer; int get_tokenizer(fts5_api* api, const char* name, UnifiedTokenizer* out) { #ifdef FTS5_V2_AVAILABLE if( api->xFindTokenizer_v2 ){ out->version = 2; return api->xFindTokenizer_v2(api, name, NULL, &out->v2); } #endif out->version = 1; return api->xFindTokenizer(api, name, &out->v1); } void tokenize(UnifiedTokenizer* tok, ...) { if(tok->version == 2) { tok->v2->xTokenize(...); } else { tok->v1.xTokenize(...); } }
Advanced Debugging Techniques
Custom Allocator Tracking
Override SQLite’s memory management to track tokenizer allocations:static int alloc_count = 0; void* tracked_malloc(int n) { void* p = malloc(n); if(strncmp(p, "fts5_tokenizer", 14) == 0) { // Simplified check alloc_count++; } return p; } void tracked_free(void* p) { if(strncmp(p, "fts5_tokenizer", 14) == 0) { alloc_count--; } free(p); } // During initialization: sqlite3_config(SQLITE_CONFIG_MALLOC, &tracked_malloc, &tracked_free);
StackTrace Capture on Allocation
Use platform-specific APIs to log allocation origins:#include <execinfo.h> #define MAX_STACK 20 void* fts5_malloc(size_t size) { void* addrs[MAX_STACK]; backtrace(addrs, MAX_STACK); // Log size and stack trace return malloc(size); } // In SQLite FTS5 code: #define sqlite3_malloc fts5_malloc
Lifetime Visualization Tools
Create connection-tokenizer dependency graphs using DOT notation:from graphviz import Digraph class ConnectionVisualizer: def __init__(self): self.graph = Digraph() self.conn_count = 0 def add_connection(self, conn): cid = f"conn_{self.conn_count}" self.graph.node(cid, label=f"Connection {self.conn_count}") self.conn_count += 1 return cid def add_tokenizer(self, cid, tok): tid = f"tok_{id(tok)}" self.graph.node(tid, label="Tokenizer", shape="box") self.graph.edge(cid, tid)
Performance Considerations
Tokenizer Caching Strategies
Implement safe cross-connection tokenizer reuse:static fts5_tokenizer_v2* g_cached_tok = NULL; static sqlite3* g_last_conn = NULL; int get_cached_tokenizer(sqlite3* db, fts5_api* api, fts5_tokenizer_v2** out) { if(db != g_last_conn) { if(g_cached_tok) { // Free previous connection's tokenizer sqlite3_free(g_cached_tok); } if(api->xFindTokenizer_v2(api, "unicode61", NULL, out)) { return SQLITE_ERROR; } g_cached_tok = *out; g_last_conn = db; } else { *out = g_cached_tok; } return SQLITE_OK; }
Connection Pool Integration
Maintain open connections for long-lived tokenizers:from queue import Queue class TokenizerPool: def __init__(self, max_conns=5): self.pool = Queue(max_conns) for _ in range(max_conns): conn = apsw.Connection(":memory:") # Pre-warm connection if needed self.pool.put(conn) def get_tokenizer(self): conn = self.pool.get() try: return conn.fts5_tokenizer("unicode61") finally: self.pool.put(conn)
Conclusion and Version-Specific Recommendations
The FTS5 tokenizer API evolution introduces critical behavioral changes that demand careful attention to object lifetimes. Developers must choose between:
- v1 API: Full control over tokenizer lifetime at the cost of manual memory management
- v2 API: Simplified version handling with strict connection coupling
For most applications, adopting v2 with proper connection-scoped tokenizer management offers the best balance of safety and maintainability. Legacy systems requiring long-lived tokenizers should either retain v1 usage or implement robust wrapper layers that mediate access to v2 tokenizers across connection boundaries.