Character Encoding Mismatch in LIKE Clause Optimization for 256-ASCII Virtual Tables

UTF-8 to 256-ASCII Transcoding Challenges in LIKE Range Optimization

The core issue revolves around the interaction between SQLite’s UTF-8 text handling and a 256-character ASCII encoding system used by a legacy application through virtual tables. The problem manifests when executing LIKE queries with characters at the upper boundary of the 256-character set (character code 255 in 0-based indexing or 256 in 1-based systems). SQLite’s query optimizer converts LIKE 'char%' patterns to BETWEEN range comparisons for indexed columns, but this optimization fails when dealing with the maximum character value due to:

Overflow errors when attempting to calculate char+1 beyond the 256-character limit
Lossy conversions between UTF-8 and the proprietary encoding
Mismatched collation sequences between the virtual table’s native format and SQLite’s UTF-8 handling

The virtual table implementation uses conversion functions (fromUTF8()/toUTF8()) that cannot properly handle edge cases at character code 256, leading to invalid range calculations. This results in either:

Empty result sets for valid queries
Index scans becoming full table scans
Data corruption during round-trip conversions

Boundary Overflow in Character Range Calculations

1. Encoding System Limitations

The 256-character ASCII implementation likely uses 8-bit storage (0-255 range) while presenting itself as 1-based (1-256) to users. SQLite’s UTF-8 handling assumes valid Unicode code points, creating three critical mismatches:

Character 256 Interpretation:
- As 0x100 in hexadecimal (beyond single-byte storage)
- Requires multi-byte UTF-8 encoding (0xC2 0x80 for U+0080)
- Loses fidelity when converted back to 8-bit storage
Collation Sequence Differences:
- Native sorting vs. UTF-8 binary comparison
- Case folding expectations
- Special character treatment (accented letters, control codes)

Range Calculation Overflow:

/* SQLite's internal range calculation pseudocode */
char *zPattern = "char%";
int cb = strlen(zPattern)-1; /* Exclude wildcard */
char upper[4];
memcpy(upper, zPattern, cb);
upper[cb-1]++;  /* Fails at 0xFF -> 0x100 overflow */

This increment operation creates invalid upper bounds when applied to 0xFF (255 in 0-based) in 8-bit systems.

2. Virtual Table Interface Constraints

The virtual table implementation faces specific challenges through SQLite’s xBestIndex method:

Index constraint absorption:
- Must translate LIKE constraints to native range queries
- Limited to SQLITE_INDEX_CONSTRAINT_LIMIT categories
- Cannot handle custom range adjustments for edge cases

Conversion function limitations:

/* Example problematic conversion */
char256_to_utf8(0xFF) => 0xC3 0xBF (UTF-8 for U+00FF)
utf8_to_char256(0xC3 0xBF) => 0xFF (valid)

/* Edge case failure */
utf8_to_char256(0xC2 0x80) => 0x80 (U+0080)
char256_to_utf8(0x80) => 0xC2 0x80
utf8_to_char256(0xC2 0x80 +1) => undefined

Conversion functions must handle all 256 characters bidirectionally without data loss.

3. SQLite Optimization Assumptions

The SQLite query optimizer makes three critical assumptions that conflict with 256-ASCII systems:

Monotonic UTF-8 Sequences:
- Assumes BETWEEN ranges correspond directly to lexical order
- Fails when custom encodings alter sort order

Closed Upper Bound:

/* Intended conversion */
WHERE textcol LIKE 'abc%'
=> WHERE textcol BETWEEN 'abc' AND 'abd'

Works for UTF-8 but overflows at 0xFF => 0x100 in 8-bit systems

Collation Consistency:
- Expects LIKE comparison to use same rules as BETWEEN
- Breaks when virtual table uses different collation

Virtual Table Implementation Fixes and Encoding-Safe Optimizations

Step 1: Character Conversion Audit

Validate round-trip conversions for all 256 characters:

/* Test harness for conversion functions */
for(int i=0; i<256; i++){
  char8_t native = (char8_t)i;
  char *utf8 = toUTF8(&native, 1);
  char8_t roundtrip[256];
  int len = fromUTF8(utf8, strlen(utf8), roundtrip);
  assert(len == 1 && roundtrip[0] == native);
}

Fix conversion edge cases:

Map character 255 (0xFF) to UTF-8 0xC3 0xBF (U+00FF)
Ensure 0x00-0x7F map directly to UTF-8 single-byte
Handle 0x80-0xFF as either:
- Windows-1252 superset (common in legacy systems)
- ISO-8859-1 Latin-1 supplement

Step 2: Modify LIKE Range Calculation

Adjust the BETWEEN optimization to prevent overflow:

Original SQLite Logic:

LIKE 'char%' → BETWEEN 'char' AND 'char' + 1

Modified Logic for 256-ASCII:

CASE 
  WHEN RIGHT(pattern,1) < 0xFF THEN 
    BETWEEN pattern AND pattern + 1
  ELSE
    >= pattern  -- Open upper bound
END

Implementation requires virtual table xBestIndex modification:

/* In xBestIndex method */
if( pConstraint->op == SQLITE_INDEX_CONSTRAINT_LIMIT ){
  if( is_like_pattern_edge_case(pConstraint) ){
    /* Handle as >= constraint only */
    pConstraint->usable = 1;
    argvIndex[0] = 1;
    omit[0] = 1;
  }
}

Step 3: Custom Collation Sequence

sqlite3_create_collation(db, "NATIVE_256", SQLITE_UTF8, 
  NULL, native_256_collation);

/* In virtual table xBestIndex */
pTab->azColl[iColl] = "NATIVE_256";

Collation function example:

int native_256_collation(void *pArg, int l1, const void *v1,
                         int l2, const void *v2){
  /* Convert both to native 256-ASCII first */
  char8_t *s1 = fromUTF8(v1, l1);
  char8_t *s2 = fromUTF8(v2, l2);
  return memcmp(s1, s2, min(l1,l2));
}

Step 4: Virtual Table Index Constraint Handling

Enhance the virtual table’s constraint absorption logic:

Detect LIKE patterns in xFilter:

if( pIdxInfo->aConstraintUsage[i].argvIndex > 0 ){
  parse_like_pattern(pIdxInfo->aConstraint[i].pExpr);
}

Handle overflow-safe ranges:

if( upper_bound_char == 0xFF ){
  /* Use >= constraint instead of BETWEEN */
  pCost->planFlags |= WHERE_COLUMN_RANGE;
}

Adjust index scan boundaries:

if( pCursor->eSearchOp == SQLITE_INDEX_CONSTRAINT_LIMIT ){
  /* Clamp upper bound to 0xFF */
  nativeUpper = min(nativeUpper, 0xFF);
}

Step 5: SQLite Compile-Time Options

For permanent solutions, consider custom SQLite builds with:

Modified range optimization (sqlite3.c changes):

/* In wherecode.c, whereLoopAddBtree() */
if( pTop->eOperator & WO_LT ){
  if( is_max_char(pRight) ){
    pTop->eOperator = WO_LE;
    pTop->pExpr->op = TK_LE;
  }
}

8-bit Text Mode (experimental):

#ifdef SQLITE_8BIT_TEXT
# define SQLITE_SKIP_UTF8 1
/* Override text encoding functions */
#endif

Step 6: Query Pattern Rewriting

Intercept queries before execution to rewrite problematic LIKE patterns:

-- Original
SELECT * FROM t WHERE descr LIKE '\xff%';

-- Rewritten
SELECT * FROM t WHERE descr >= '\xff' 
  AND (descr < '\xff' OR descr = '\xff')

Implementation via sqlite3_preupdate_hook or client-side query parsing.

Step 7: Virtual Table Storage Optimization

Modify the virtual table to store text in both native and UTF-8 formats:

CREATE TABLE shadow_cols (
  native BLOB,  -- 256-ASCII bytes
  utf8 TEXT,    -- Converted UTF-8
  descr_col_virtual ...
);

Use triggers to keep both representations synchronized, allowing optimized searches on the native column while maintaining UTF-8 compatibility.

Final Implementation Checklist

Conversion Validation:
- 100% round-trip fidelity for all 256 characters
- Benchmark conversion functions for speed
Collation Testing:
- Verify sort order matches native application
- Test edge cases (0x00, 0xFF, mixed case)
Virtual Table Modifications:
- Updated xBestIndex constraint handling
- LIKE pattern detection logic
- Overflow-safe range clamping
Query Optimization:
- EXPLAIN QUERY PLAN verification
- Index usage confirmation
- Performance profiling
Fallback Mechanisms:
- Automatic query rewriting
- Conversion error logging
- Legacy mode switches

By systematically addressing the conversion fidelity, range calculation overflow, and collation sequence alignment, developers can maintain SQLite’s LIKE optimization benefits while preserving compatibility with legacy 256-character ASCII systems. The solution requires careful coordination between virtual table implementation details and SQLite’s query optimization behaviors, but enables continued use of modern SQL features with heritage encoding systems.

Handling Edge Cases in LIKE Optimization with 256-Character ASCII and Virtual Tables

Character Encoding Mismatch in LIKE Clause Optimization for 256-ASCII Virtual Tables

UTF-8 to 256-ASCII Transcoding Challenges in LIKE Range Optimization

Boundary Overflow in Character Range Calculations

1. Encoding System Limitations

2. Virtual Table Interface Constraints

3. SQLite Optimization Assumptions

Virtual Table Implementation Fixes and Encoding-Safe Optimizations

Step 1: Character Conversion Audit

Step 2: Modify LIKE Range Calculation

Step 3: Custom Collation Sequence

Step 4: Virtual Table Index Constraint Handling

Step 5: SQLite Compile-Time Options

Step 6: Query Pattern Rewriting

Step 7: Virtual Table Storage Optimization

Final Implementation Checklist

Standard SQL Equivalents for SQLite UPDATE FROM Queries

Unexpected JSON Behavior in SQLite Queries Due to Query Optimizer

Unexpected Query Results Due to Equivalence Transfer Optimization in SQLite

PRAGMA table_info Reports Incorrect Data Types for View Columns in SQLite 3.41

FTS5 Synonym Handling: Dynamic Queries, Tokenizer Customization, and Phrase Matching

Resolving Conditional Column Updates Using Range Lookups in SQLite

Leave a Reply Cancel reply

Character Encoding Mismatch in LIKE Clause Optimization for 256-ASCII Virtual Tables

UTF-8 to 256-ASCII Transcoding Challenges in LIKE Range Optimization

Boundary Overflow in Character Range Calculations

1. Encoding System Limitations

2. Virtual Table Interface Constraints

3. SQLite Optimization Assumptions

Virtual Table Implementation Fixes and Encoding-Safe Optimizations

Step 1: Character Conversion Audit

Step 2: Modify LIKE Range Calculation

Step 3: Custom Collation Sequence

Step 4: Virtual Table Index Constraint Handling

Step 5: SQLite Compile-Time Options

Step 6: Query Pattern Rewriting

Step 7: Virtual Table Storage Optimization

Final Implementation Checklist

Related Guides

Leave a Reply Cancel reply