Custom Collation in SQLite Fails with Multi-Column ORDER BY


Collation Behavior Discrepancy Between Single/Multi-Column Sorting and Concatenation

Key Symptoms and Observed Behavior

The core issue arises when implementing a custom collation sequence (ascii7_nocase_nospace) designed to:

  1. Perform case-insensitive comparisons
  2. Ignore accents
  3. Remove spaces during sorting

While the collation works correctly for single-column ORDER BY operations and concatenated column sorting, it produces unexpected results when sorting by multiple columns independently. Specific manifestations include:

  1. Tie-breaking failure in multi-column sorts:

    ORDER BY lname COLLATE ascii7_nocase_nospace, fname COLLATE ascii7_nocase_nospace
    

    Produces identical ordering to single-column sorts on lname, ignoring fname comparisons for equivalent lname values

  2. Correct behavior only when using concatenated columns:

    ORDER BY (lname || fname) COLLATE ascii7_nocase_nospace
    

    Achieves the desired lexicographical order across both columns

  3. Inconsistent equivalence detection between collation variants (e.g., Van den Berghe vs Vandenberghe vs vanden Berghe), particularly when spaces are present in different positions


Collation Implementation Flaws and Sorting Logic Misalignment

Primary Culprits in Custom Collation Code

  1. Non-terminated transformed strings
    Initial implementation of space removal failed to add null terminators to processed strings, causing strcmp() comparisons to read beyond intended buffers

  2. Buffer size mismatches

    char pOriginalText1[nKey1 + 1];  // Uses nKey1 for both buffers
    char pOriginalText2[nKey1 + 1]; // Should be nKey2 for second buffer
    

    Memory allocation errors when handling different-length inputs

  3. Collation sequence contract violations

    • Failure to maintain transitive property: If A=B and B=C, then A=C
    • Inconsistent comparison symmetry: (A < B) ≠ (B > A) in edge cases
    • Length-based tie-breaking contaminating equivalence detection
  4. Collation scope misunderstanding
    Assuming implicit collation inheritance across columns:

    ORDER BY col1, col2 COLLATE custom  -- Only applies to col2!
    

Sorting Algorithm Limitations

  1. Multi-column vs concatenated sort divergence

    • Column-wise: Compares col1 values first, uses col2 only for identical col1
    • Concatenated: Treats col1||col2 as single strings, enabling cross-column character comparisons
  2. Space removal altering sort hierarchy

    Original: "Van den Berghe" vs "Vandenberghe"  
    Processed: "Vandenberghe" vs "Vandenberghe" (equivalent)  
    --> Secondary sort on fname expected but not executed  
    

Collation Debugging Methodology and Permanent Fix Implementation

Step 1: Validate Collation Contract Compliance

Execute symmetry and transitivity tests:

-- Symmetry check
SELECT a.rowid, b.rowid 
FROM names a, names b 
WHERE a.rowid < b.rowid 
  AND (a.lname < b.lname COLLATE ascii7_nocase_nospace) 
  != (b.lname > a.lname COLLATE ascii7_nocase_nospace);

-- Transitivity check
WITH Triplets AS (
  SELECT a.lname AS x, b.lname AS y, c.lname AS z
  FROM names a, names b, names c
  WHERE a.rowid != b.rowid AND b.rowid != c.rowid
)
SELECT x, y, z
FROM Triplets
WHERE x = y COLLATE ascii7_nocase_nospace
  AND y = z COLLATE ascii7_nocase_nospace
  AND x != z COLLATE ascii7_nocase_nospace;

Step 2: Instrument Collation Function

Add debug logging to track string transformations:

static int ascii7Collate(void *data, int nKey1, const void *pKey1,
                         int nKey2, const void *pKey2) {
  char *norm1 = normalize(pKey1, nKey1);
  char *norm2 = normalize(pKey2, nKey2);
  
  log("Comparing '%s'(%d) vs '%s'(%d)", norm1, nKey1, norm2, nKey2);
  
  int result = strcmp(norm1, norm2);
  free(norm1); free(norm2);
  return result;
}

Step 3: Fix Memory Handling

Ensure proper buffer allocation and termination:

char* remove_spaces(const char* input, int length) {
  char* output = malloc(length + 1);
  int j = 0;
  for (int i = 0; i < length; i++) {
    if (input[i] != ' ') output[j++] = tolower(input[i]);
  }
  output[j] = '\0'; // Critical null terminator
  return output;
}

Step 4: Implement Multi-Level Sorting

Modify collation to support tie-breaking:

static int ascii7Collate(void *data, int nKey1, const void *pKey1,
                         int nKey2, const void *pKey2) {
  // Primary comparison: normalized content
  int content_cmp = compare_normalized(pKey1, nKey1, pKey2, nKey2);
  
  if (content_cmp != 0) return content_cmp;
  
  // Secondary tie-breaker: original length
  int len_diff = nKey1 - nKey2;
  return (len_diff > 0) ? 1 : (len_diff < 0) ? -1 : 0;
}

Step 5: Explicit Collation Application

Modify queries to apply collation correctly:

-- Before (incorrect)
SELECT * FROM names 
ORDER BY lname COLLATE ascii7_nocase_nospace, fname;

-- After (correct)
SELECT * FROM names 
ORDER BY 
  lname COLLATE ascii7_nocase_nospace,
  fname COLLATE ascii7_nocase_nospace;

Step 6: Benchmark Against Built-in Collations

Verify behavior parity using NOCASE:

EXPLAIN QUERY PLAN
SELECT rowid FROM names
ORDER BY lname COLLATE NOCASE, fname COLLATE NOCASE;

EXPLAIN QUERY PLAN 
SELECT rowid FROM names
ORDER BY (lname || fname) COLLATE NOCASE;

Permanent Fix Checklist

  1. Null termination of all processed strings
  2. Buffer size validation for unequal input lengths
  3. Secondary sort criteria implementation in collation
  4. Explicit collation application to each sorted column
  5. Automated collation validation using unit test queries

By methodically addressing buffer management, collation contract requirements, and explicit sorting scope declaration, the custom collation achieves consistent behavior across both single-column and multi-column sorting scenarios. The critical realization lies in understanding that SQLite applies collations per-column in ORDER BY clauses, requiring explicit designation for each sorted expression. Combined with rigorous memory safety practices in custom collation implementations, this resolves the observed discrepancies between concatenated and multi-column sorting approaches.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *