Segfaults When Using FTS5 Tokenizer v2 API After Closing Connection

Understanding FTS5 Tokenizer API Lifetime Changes and Segmentation Faults

FTS5 Tokenizer API Version Differences and Memory Management

The core issue revolves around behavioral changes between the FTS5 tokenizer v1 and v2 APIs in SQLite, specifically regarding object lifetime management and memory ownership. When using the v2 API, attempts to free tokenizer objects after closing their parent database connection result in segmentation faults, while the same pattern works correctly with v1. This discrepancy stems from fundamental architectural differences in how these API versions handle tokenizer object ownership and lifecycle management.

In the v1 API (fts5_tokenizer), developers supply their own struct instance that SQLite populates with function pointers. The caller retains full ownership of this memory, allowing tokenizer instances to persist beyond connection closure without conflict. The v2 API (fts5_tokenizer_v2) reverses this relationship: SQLite provides a pre-allocated struct containing both the tokenizer implementation and internal state. This structure becomes invalid when the parent database connection closes, as FTS5 automatically cleans up associated resources. Attempting to use or free these now-invalid pointers leads to undefined behavior, including crashes.

The critical distinction lies in ownership semantics:

v1: Caller-owned tokenizer struct with SQLite-populated function pointers
v2: SQLite-owned tokenizer struct containing implementation details

This architectural shift introduces strict lifetime coupling between database connections and tokenizer instances in v2. While v1 allowed tokenizers to outlive their originating connections (provided developers managed memory correctly), v2 intrinsically links tokenizer validity to connection lifetime. The segmentation fault occurs because closing the connection triggers FTS5’s cleanup routines, which deallocate the tokenizer struct. Subsequent attempts to interact with the tokenizer (including explicit deletion) access freed memory.

Root Causes of Post-Connection Closure Tokenizer Invalidations

Three primary factors contribute to this behavioral divergence:

Versioned Object Handling in v2 API
The fts5_tokenizer_v2 structure contains version-specific fields that may expand over time. By returning SQLite-owned structs, FTS5 ensures binary compatibility across versions. However, this design transfers memory management responsibility to the database engine, creating implicit lifetime dependencies. When connections close, SQLite immediately reclaims these version-sensitive structures rather than waiting for explicit developer cleanup.
Absence of Reference Counting
Unlike some SQLite APIs that employ reference counting for shared resources, FTS5 tokenizers maintain a direct owner-relationship with their parent connection. The v2 API provides no mechanism to extend tokenizer lifetime beyond connection closure, even if outstanding references exist. This contrasts with v1’s caller-owned model, where developers could implement custom reference tracking.
Callback Function Ownership
The v2 API bundles tokenizer implementation details (xCreate, xTokenize, xDelete) within SQLite-managed memory. When the connection closes:
- Tokenizer structs containing these function pointers get deallocated
- Any copies of the pointers become dangling references
- Subsequent calls through these pointers (even if captured earlier) access invalid memory

This explains why simply copying the xDelete/xTokenize function pointers from v2 structs doesn’t prevent crashes – the implementations behind these pointers may rely on connection-specific context that gets destroyed during closure.

Resolving Tokenizer Lifetime Conflicts in FTS5 v2 Implementations

Immediate Mitigation Strategies

Connection Lifetime Alignment
Restructure code to ensure tokenizer objects never outlive their parent connections. In garbage-collected environments like Python:
```
with apsw.Connection(":memory:") as conn:
    tokenizer = conn.fts5_tokenizer("unicode61")
    # Use tokenizer within connection scope
# Tokenizer automatically invalidated post-with-block
```
Manually nullify tokenizer references before closing connections in non-GC contexts.
API Version Selection
Continue using the v1 API (fts5_tokenizer) when long-lived tokenizers are necessary. The v1 pattern remains valid and safe when properly managed:
```
fts5_tokenizer tok;
pApi->xFindTokenizer(pApi, "unicode61", &tok);
/* Safe to use tok after connection closure */
```

Function Pointer Isolation
Extract and persist v2 tokenizer functions independently of their container struct:

class TokenizerWrapper:
    def __init__(self, v2_struct):
        self.xCreate = v2_struct.xCreate
        self.xTokenize = v2_struct.xTokenize
        self.xDelete = v2_struct.xDelete

conn = apsw.Connection(":memory:")
raw_tok = conn.fts5_tokenizer("unicode61")
wrapped_tok = TokenizerWrapper(raw_tok)
conn.close()
# Use wrapped_tok.xDelete() cautiously

This captures the necessary functions while decoupling from SQLite’s memory management, though proper cleanup timing remains crucial.

Long-Term Architectural Solutions

Connection Closure Hooks
Implement connection-specific cleanup routines that invalidate associated tokenizers:

typedef struct {
    fts5_tokenizer_v2* tok;
    sqlite3* db;
} TokenizerContext;

void connection_close_handler(void* data) {
    TokenizerContext* ctx = (TokenizerContext*)data;
    ctx->tok = NULL; // Invalidate before SQLite cleanup
    sqlite3_free(ctx);
}

// When creating tokenizer:
TokenizerContext* ctx = sqlite3_malloc(sizeof(*ctx));
ctx->db = db;
pApi->xFindTokenizer_v2(pApi, "unicode61", NULL, &ctx->tok);
sqlite3_close_hook(db, connection_close_handler, ctx);

Custom Reference Counting
Wrap v2 tokenizers in reference-counted containers that coordinate with connection state:

import weakref

class SafeTokenizer:
    def __init__(self, conn, name):
        self._conn_ref = weakref.ref(conn)
        self._tok = conn.fts5_tokenizer(name)
        self._active = True

    def __del__(self):
        if self._active:
            self._tok.xDelete()

    def close(self):
        if self._active and self._conn_ref() is not None:
            self._tok.xDelete()
            self._active = False

# Usage:
conn = apsw.Connection(":memory:")
tok = SafeTokenizer(conn, "unicode61")
conn.close()
tok.close() # Manual cleanup before GC

API Version Detection and Adaptation
Implement runtime checks to select optimal API usage based on availability:

#if SQLITE_VERSION_NUMBER >= 3045000
#define USE_FTS5_V2 1
#else
#define USE_FTS5_V2 0
#endif

void setup_tokenizer(fts5_api* pApi, const char* name) {
#if USE_FTS5_V2
    fts5_tokenizer_v2* tok;
    if( pApi->xFindTokenizer_v2(pApi, name, NULL, &tok) ){
        // Handle error
    }
    // Track connection lifetime
#else
    fts5_tokenizer tok;
    if( pApi->xFindTokenizer(pApi, name, &tok) ){
        // Handle error
    }
    // Caller manages lifetime
#endif
}

Deep Dive: SQLite Internal Mechanics

Understanding FTS5’s module registration helps explain the lifetime constraints. When using xCreateTokenizer, SQLite stores tokenizer implementations in a connection-specific registry. The v2 API returns pointers to these registered implementations, which get destroyed during sqlite3_close_v2:

Connection Closure Sequence
- Invoke sqlite3_close()
- Call sqlite3LeaveMutexAndCloseZombie()
- Trigger sqlite3VtabModuleUnref() for each registered module
- Execute fts5ModuleDestroy() for FTS5 modules
- Free tokenizer structures via sqlite3_free()
Tokenization Process Flow
- xFindTokenizer_v2 retrieves module from connection registry
- Tokenizer struct contains pointers to current connection’s resources
- Post-closure, these pointers reference deallocated memory
Memory Sanitizer Diagnostics
Address sanitizers detect invalid accesses through:
- Use-after-free: Accessing tokenizer struct post-closure
- Heap-use-after-free: Calling xDelete/xTokenize after parent struct deallocation
- Invalid pointer dereference: Using stale function pointers

Best Practices for Stable Tokenizer Management

Lifetime Documentation
Explicitly document tokenizer-connection relationships:
"FTS5 tokenizers obtained via xFindTokenizer_v2 remain valid only while their parent database connection exists. Applications must ensure all tokenizer references are discarded before connection closure."

Automated Lifetime Tracking
Integrate tokenizer management with connection objects:

class ManagedConnection(apsw.Connection):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._tokenizers = []

    def fts5_tokenizer(self, name):
        tok = super().fts5_tokenizer(name)
        self._tokenizers.append(tok)
        return tok

    def close(self):
        for tok in self._tokenizers:
            tok.xDelete() # Optional based on API version
        self._tokenizers.clear()
        super().close()

Cross-Version Compatibility Layers
Create abstraction layers that normalize API differences:

typedef struct {
    int version;
    union {
        fts5_tokenizer v1;
        fts5_tokenizer_v2* v2;
    };
} UnifiedTokenizer;

int get_tokenizer(fts5_api* api, const char* name, UnifiedTokenizer* out) {
#ifdef FTS5_V2_AVAILABLE
    if( api->xFindTokenizer_v2 ){
        out->version = 2;
        return api->xFindTokenizer_v2(api, name, NULL, &out->v2);
    }
#endif
    out->version = 1;
    return api->xFindTokenizer(api, name, &out->v1);
}

void tokenize(UnifiedTokenizer* tok, ...) {
    if(tok->version == 2) {
        tok->v2->xTokenize(...);
    } else {
        tok->v1.xTokenize(...);
    }
}

Advanced Debugging Techniques

Custom Allocator Tracking
Override SQLite’s memory management to track tokenizer allocations:

static int alloc_count = 0;

void* tracked_malloc(int n) {
    void* p = malloc(n);
    if(strncmp(p, "fts5_tokenizer", 14) == 0) { // Simplified check
        alloc_count++;
    }
    return p;
}

void tracked_free(void* p) {
    if(strncmp(p, "fts5_tokenizer", 14) == 0) {
        alloc_count--;
    }
    free(p);
}

// During initialization:
sqlite3_config(SQLITE_CONFIG_MALLOC, &tracked_malloc, &tracked_free);

StackTrace Capture on Allocation
Use platform-specific APIs to log allocation origins:

#include <execinfo.h>

#define MAX_STACK 20
void* fts5_malloc(size_t size) {
    void* addrs[MAX_STACK];
    backtrace(addrs, MAX_STACK);
    // Log size and stack trace
    return malloc(size);
}

// In SQLite FTS5 code:
#define sqlite3_malloc fts5_malloc

Lifetime Visualization Tools
Create connection-tokenizer dependency graphs using DOT notation:

from graphviz import Digraph

class ConnectionVisualizer:
    def __init__(self):
        self.graph = Digraph()
        self.conn_count = 0
    
    def add_connection(self, conn):
        cid = f"conn_{self.conn_count}"
        self.graph.node(cid, label=f"Connection {self.conn_count}")
        self.conn_count += 1
        return cid

    def add_tokenizer(self, cid, tok):
        tid = f"tok_{id(tok)}"
        self.graph.node(tid, label="Tokenizer", shape="box")
        self.graph.edge(cid, tid)

Performance Considerations

Tokenizer Caching Strategies
Implement safe cross-connection tokenizer reuse:

static fts5_tokenizer_v2* g_cached_tok = NULL;
static sqlite3* g_last_conn = NULL;

int get_cached_tokenizer(sqlite3* db, fts5_api* api, fts5_tokenizer_v2** out) {
    if(db != g_last_conn) {
        if(g_cached_tok) {
            // Free previous connection's tokenizer
            sqlite3_free(g_cached_tok);
        }
        if(api->xFindTokenizer_v2(api, "unicode61", NULL, out)) {
            return SQLITE_ERROR;
        }
        g_cached_tok = *out;
        g_last_conn = db;
    } else {
        *out = g_cached_tok;
    }
    return SQLITE_OK;
}

Connection Pool Integration
Maintain open connections for long-lived tokenizers:

from queue import Queue

class TokenizerPool:
    def __init__(self, max_conns=5):
        self.pool = Queue(max_conns)
        for _ in range(max_conns):
            conn = apsw.Connection(":memory:")
            # Pre-warm connection if needed
            self.pool.put(conn)

    def get_tokenizer(self):
        conn = self.pool.get()
        try:
            return conn.fts5_tokenizer("unicode61")
        finally:
            self.pool.put(conn)

Conclusion and Version-Specific Recommendations

The FTS5 tokenizer API evolution introduces critical behavioral changes that demand careful attention to object lifetimes. Developers must choose between:

v1 API: Full control over tokenizer lifetime at the cost of manual memory management
v2 API: Simplified version handling with strict connection coupling

For most applications, adopting v2 with proper connection-scoped tokenizer management offers the best balance of safety and maintainability. Legacy systems requiring long-lived tokenizers should either retain v1 usage or implement robust wrapper layers that mediate access to v2 tokenizers across connection boundaries.

Segfaults When Using FTS5 Tokenizer v2 API After Closing Connection

Understanding FTS5 Tokenizer API Lifetime Changes and Segmentation Faults

FTS5 Tokenizer API Version Differences and Memory Management

Root Causes of Post-Connection Closure Tokenizer Invalidations

Resolving Tokenizer Lifetime Conflicts in FTS5 v2 Implementations

Resolving Data Mismatch Errors in SQLite Multiplex3 Tests with Custom Build Flags

Resolving SQLite Insert Operation Exceptions on Windows MSI Installations

Recovering from FOREIGN KEY Constraint Failures in SQLite

Improving SQLite Documentation for Window Functions Discoverability

Deadlock in SQLiteConnection.Open() with WPF WeakConnectionPool and STAThread

Troubleshooting FOREIGN KEY Constraint Violation in SQLite with Tcl onecolumn Command

Leave a Reply Cancel reply

Understanding FTS5 Tokenizer API Lifetime Changes and Segmentation Faults

FTS5 Tokenizer API Version Differences and Memory Management

Root Causes of Post-Connection Closure Tokenizer Invalidations

Resolving Tokenizer Lifetime Conflicts in FTS5 v2 Implementations

Related Guides

Leave a Reply Cancel reply