Segfaults When Using FTS5 Tokenizer v2 API After Closing Connection

Understanding FTS5 Tokenizer API Lifetime Changes and Segmentation Faults

FTS5 Tokenizer API Version Differences and Memory Management

The core issue revolves around behavioral changes between the FTS5 tokenizer v1 and v2 APIs in SQLite, specifically regarding object lifetime management and memory ownership. When using the v2 API, attempts to free tokenizer objects after closing their parent database connection result in segmentation faults, while the same pattern works correctly with v1. This discrepancy stems from fundamental architectural differences in how these API versions handle tokenizer object ownership and lifecycle management.

In the v1 API (fts5_tokenizer), developers supply their own struct instance that SQLite populates with function pointers. The caller retains full ownership of this memory, allowing tokenizer instances to persist beyond connection closure without conflict. The v2 API (fts5_tokenizer_v2) reverses this relationship: SQLite provides a pre-allocated struct containing both the tokenizer implementation and internal state. This structure becomes invalid when the parent database connection closes, as FTS5 automatically cleans up associated resources. Attempting to use or free these now-invalid pointers leads to undefined behavior, including crashes.

The critical distinction lies in ownership semantics:

  • v1: Caller-owned tokenizer struct with SQLite-populated function pointers
  • v2: SQLite-owned tokenizer struct containing implementation details

This architectural shift introduces strict lifetime coupling between database connections and tokenizer instances in v2. While v1 allowed tokenizers to outlive their originating connections (provided developers managed memory correctly), v2 intrinsically links tokenizer validity to connection lifetime. The segmentation fault occurs because closing the connection triggers FTS5’s cleanup routines, which deallocate the tokenizer struct. Subsequent attempts to interact with the tokenizer (including explicit deletion) access freed memory.

Root Causes of Post-Connection Closure Tokenizer Invalidations

Three primary factors contribute to this behavioral divergence:

  1. Versioned Object Handling in v2 API
    The fts5_tokenizer_v2 structure contains version-specific fields that may expand over time. By returning SQLite-owned structs, FTS5 ensures binary compatibility across versions. However, this design transfers memory management responsibility to the database engine, creating implicit lifetime dependencies. When connections close, SQLite immediately reclaims these version-sensitive structures rather than waiting for explicit developer cleanup.

  2. Absence of Reference Counting
    Unlike some SQLite APIs that employ reference counting for shared resources, FTS5 tokenizers maintain a direct owner-relationship with their parent connection. The v2 API provides no mechanism to extend tokenizer lifetime beyond connection closure, even if outstanding references exist. This contrasts with v1’s caller-owned model, where developers could implement custom reference tracking.

  3. Callback Function Ownership
    The v2 API bundles tokenizer implementation details (xCreate, xTokenize, xDelete) within SQLite-managed memory. When the connection closes:

    • Tokenizer structs containing these function pointers get deallocated
    • Any copies of the pointers become dangling references
    • Subsequent calls through these pointers (even if captured earlier) access invalid memory

This explains why simply copying the xDelete/xTokenize function pointers from v2 structs doesn’t prevent crashes – the implementations behind these pointers may rely on connection-specific context that gets destroyed during closure.

Resolving Tokenizer Lifetime Conflicts in FTS5 v2 Implementations

Immediate Mitigation Strategies

  1. Connection Lifetime Alignment
    Restructure code to ensure tokenizer objects never outlive their parent connections. In garbage-collected environments like Python:

    with apsw.Connection(":memory:") as conn:
        tokenizer = conn.fts5_tokenizer("unicode61")
        # Use tokenizer within connection scope
    # Tokenizer automatically invalidated post-with-block
    

    Manually nullify tokenizer references before closing connections in non-GC contexts.

  2. API Version Selection
    Continue using the v1 API (fts5_tokenizer) when long-lived tokenizers are necessary. The v1 pattern remains valid and safe when properly managed:

    fts5_tokenizer tok;
    pApi->xFindTokenizer(pApi, "unicode61", &tok);
    /* Safe to use tok after connection closure */
    
  3. Function Pointer Isolation
    Extract and persist v2 tokenizer functions independently of their container struct:

    class TokenizerWrapper:
        def __init__(self, v2_struct):
            self.xCreate = v2_struct.xCreate
            self.xTokenize = v2_struct.xTokenize
            self.xDelete = v2_struct.xDelete
    
    conn = apsw.Connection(":memory:")
    raw_tok = conn.fts5_tokenizer("unicode61")
    wrapped_tok = TokenizerWrapper(raw_tok)
    conn.close()
    # Use wrapped_tok.xDelete() cautiously
    

    This captures the necessary functions while decoupling from SQLite’s memory management, though proper cleanup timing remains crucial.

Long-Term Architectural Solutions

  1. Connection Closure Hooks
    Implement connection-specific cleanup routines that invalidate associated tokenizers:

    typedef struct {
        fts5_tokenizer_v2* tok;
        sqlite3* db;
    } TokenizerContext;
    
    void connection_close_handler(void* data) {
        TokenizerContext* ctx = (TokenizerContext*)data;
        ctx->tok = NULL; // Invalidate before SQLite cleanup
        sqlite3_free(ctx);
    }
    
    // When creating tokenizer:
    TokenizerContext* ctx = sqlite3_malloc(sizeof(*ctx));
    ctx->db = db;
    pApi->xFindTokenizer_v2(pApi, "unicode61", NULL, &ctx->tok);
    sqlite3_close_hook(db, connection_close_handler, ctx);
    
  2. Custom Reference Counting
    Wrap v2 tokenizers in reference-counted containers that coordinate with connection state:

    import weakref
    
    class SafeTokenizer:
        def __init__(self, conn, name):
            self._conn_ref = weakref.ref(conn)
            self._tok = conn.fts5_tokenizer(name)
            self._active = True
    
        def __del__(self):
            if self._active:
                self._tok.xDelete()
    
        def close(self):
            if self._active and self._conn_ref() is not None:
                self._tok.xDelete()
                self._active = False
    
    # Usage:
    conn = apsw.Connection(":memory:")
    tok = SafeTokenizer(conn, "unicode61")
    conn.close()
    tok.close() # Manual cleanup before GC
    
  3. API Version Detection and Adaptation
    Implement runtime checks to select optimal API usage based on availability:

    #if SQLITE_VERSION_NUMBER >= 3045000
    #define USE_FTS5_V2 1
    #else
    #define USE_FTS5_V2 0
    #endif
    
    void setup_tokenizer(fts5_api* pApi, const char* name) {
    #if USE_FTS5_V2
        fts5_tokenizer_v2* tok;
        if( pApi->xFindTokenizer_v2(pApi, name, NULL, &tok) ){
            // Handle error
        }
        // Track connection lifetime
    #else
        fts5_tokenizer tok;
        if( pApi->xFindTokenizer(pApi, name, &tok) ){
            // Handle error
        }
        // Caller manages lifetime
    #endif
    }
    

Deep Dive: SQLite Internal Mechanics

Understanding FTS5’s module registration helps explain the lifetime constraints. When using xCreateTokenizer, SQLite stores tokenizer implementations in a connection-specific registry. The v2 API returns pointers to these registered implementations, which get destroyed during sqlite3_close_v2:

  1. Connection Closure Sequence

    • Invoke sqlite3_close()
    • Call sqlite3LeaveMutexAndCloseZombie()
    • Trigger sqlite3VtabModuleUnref() for each registered module
    • Execute fts5ModuleDestroy() for FTS5 modules
    • Free tokenizer structures via sqlite3_free()
  2. Tokenization Process Flow

    • xFindTokenizer_v2 retrieves module from connection registry
    • Tokenizer struct contains pointers to current connection’s resources
    • Post-closure, these pointers reference deallocated memory
  3. Memory Sanitizer Diagnostics
    Address sanitizers detect invalid accesses through:

    • Use-after-free: Accessing tokenizer struct post-closure
    • Heap-use-after-free: Calling xDelete/xTokenize after parent struct deallocation
    • Invalid pointer dereference: Using stale function pointers

Best Practices for Stable Tokenizer Management

  1. Lifetime Documentation
    Explicitly document tokenizer-connection relationships:

    "FTS5 tokenizers obtained via xFindTokenizer_v2 remain valid only while their parent database connection exists. Applications must ensure all tokenizer references are discarded before connection closure."

  2. Automated Lifetime Tracking
    Integrate tokenizer management with connection objects:

    class ManagedConnection(apsw.Connection):
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
            self._tokenizers = []
    
        def fts5_tokenizer(self, name):
            tok = super().fts5_tokenizer(name)
            self._tokenizers.append(tok)
            return tok
    
        def close(self):
            for tok in self._tokenizers:
                tok.xDelete() # Optional based on API version
            self._tokenizers.clear()
            super().close()
    
  3. Cross-Version Compatibility Layers
    Create abstraction layers that normalize API differences:

    typedef struct {
        int version;
        union {
            fts5_tokenizer v1;
            fts5_tokenizer_v2* v2;
        };
    } UnifiedTokenizer;
    
    int get_tokenizer(fts5_api* api, const char* name, UnifiedTokenizer* out) {
    #ifdef FTS5_V2_AVAILABLE
        if( api->xFindTokenizer_v2 ){
            out->version = 2;
            return api->xFindTokenizer_v2(api, name, NULL, &out->v2);
        }
    #endif
        out->version = 1;
        return api->xFindTokenizer(api, name, &out->v1);
    }
    
    void tokenize(UnifiedTokenizer* tok, ...) {
        if(tok->version == 2) {
            tok->v2->xTokenize(...);
        } else {
            tok->v1.xTokenize(...);
        }
    }
    

Advanced Debugging Techniques

  1. Custom Allocator Tracking
    Override SQLite’s memory management to track tokenizer allocations:

    static int alloc_count = 0;
    
    void* tracked_malloc(int n) {
        void* p = malloc(n);
        if(strncmp(p, "fts5_tokenizer", 14) == 0) { // Simplified check
            alloc_count++;
        }
        return p;
    }
    
    void tracked_free(void* p) {
        if(strncmp(p, "fts5_tokenizer", 14) == 0) {
            alloc_count--;
        }
        free(p);
    }
    
    // During initialization:
    sqlite3_config(SQLITE_CONFIG_MALLOC, &tracked_malloc, &tracked_free);
    
  2. StackTrace Capture on Allocation
    Use platform-specific APIs to log allocation origins:

    #include <execinfo.h>
    
    #define MAX_STACK 20
    void* fts5_malloc(size_t size) {
        void* addrs[MAX_STACK];
        backtrace(addrs, MAX_STACK);
        // Log size and stack trace
        return malloc(size);
    }
    
    // In SQLite FTS5 code:
    #define sqlite3_malloc fts5_malloc
    
  3. Lifetime Visualization Tools
    Create connection-tokenizer dependency graphs using DOT notation:

    from graphviz import Digraph
    
    class ConnectionVisualizer:
        def __init__(self):
            self.graph = Digraph()
            self.conn_count = 0
        
        def add_connection(self, conn):
            cid = f"conn_{self.conn_count}"
            self.graph.node(cid, label=f"Connection {self.conn_count}")
            self.conn_count += 1
            return cid
    
        def add_tokenizer(self, cid, tok):
            tid = f"tok_{id(tok)}"
            self.graph.node(tid, label="Tokenizer", shape="box")
            self.graph.edge(cid, tid)
    

Performance Considerations

  1. Tokenizer Caching Strategies
    Implement safe cross-connection tokenizer reuse:

    static fts5_tokenizer_v2* g_cached_tok = NULL;
    static sqlite3* g_last_conn = NULL;
    
    int get_cached_tokenizer(sqlite3* db, fts5_api* api, fts5_tokenizer_v2** out) {
        if(db != g_last_conn) {
            if(g_cached_tok) {
                // Free previous connection's tokenizer
                sqlite3_free(g_cached_tok);
            }
            if(api->xFindTokenizer_v2(api, "unicode61", NULL, out)) {
                return SQLITE_ERROR;
            }
            g_cached_tok = *out;
            g_last_conn = db;
        } else {
            *out = g_cached_tok;
        }
        return SQLITE_OK;
    }
    
  2. Connection Pool Integration
    Maintain open connections for long-lived tokenizers:

    from queue import Queue
    
    class TokenizerPool:
        def __init__(self, max_conns=5):
            self.pool = Queue(max_conns)
            for _ in range(max_conns):
                conn = apsw.Connection(":memory:")
                # Pre-warm connection if needed
                self.pool.put(conn)
    
        def get_tokenizer(self):
            conn = self.pool.get()
            try:
                return conn.fts5_tokenizer("unicode61")
            finally:
                self.pool.put(conn)
    

Conclusion and Version-Specific Recommendations

The FTS5 tokenizer API evolution introduces critical behavioral changes that demand careful attention to object lifetimes. Developers must choose between:

  • v1 API: Full control over tokenizer lifetime at the cost of manual memory management
  • v2 API: Simplified version handling with strict connection coupling

For most applications, adopting v2 with proper connection-scoped tokenizer management offers the best balance of safety and maintainability. Legacy systems requiring long-lived tokenizers should either retain v1 usage or implement robust wrapper layers that mediate access to v2 tokenizers across connection boundaries.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *