Custom Collation in SQLite Fails with Multi-Column ORDER BY
Collation Behavior Discrepancy Between Single/Multi-Column Sorting and Concatenation
Key Symptoms and Observed Behavior
The core issue arises when implementing a custom collation sequence (ascii7_nocase_nospace
) designed to:
- Perform case-insensitive comparisons
- Ignore accents
- Remove spaces during sorting
While the collation works correctly for single-column ORDER BY
operations and concatenated column sorting, it produces unexpected results when sorting by multiple columns independently. Specific manifestations include:
Tie-breaking failure in multi-column sorts:
ORDER BY lname COLLATE ascii7_nocase_nospace, fname COLLATE ascii7_nocase_nospace
Produces identical ordering to single-column sorts on
lname
, ignoringfname
comparisons for equivalentlname
valuesCorrect behavior only when using concatenated columns:
ORDER BY (lname || fname) COLLATE ascii7_nocase_nospace
Achieves the desired lexicographical order across both columns
Inconsistent equivalence detection between collation variants (e.g.,
Van den Berghe
vsVandenberghe
vsvanden Berghe
), particularly when spaces are present in different positions
Collation Implementation Flaws and Sorting Logic Misalignment
Primary Culprits in Custom Collation Code
Non-terminated transformed strings
Initial implementation of space removal failed to add null terminators to processed strings, causingstrcmp()
comparisons to read beyond intended buffersBuffer size mismatches
char pOriginalText1[nKey1 + 1]; // Uses nKey1 for both buffers char pOriginalText2[nKey1 + 1]; // Should be nKey2 for second buffer
Memory allocation errors when handling different-length inputs
Collation sequence contract violations
- Failure to maintain transitive property: If A=B and B=C, then A=C
- Inconsistent comparison symmetry: (A < B) ≠ (B > A) in edge cases
- Length-based tie-breaking contaminating equivalence detection
Collation scope misunderstanding
Assuming implicit collation inheritance across columns:ORDER BY col1, col2 COLLATE custom -- Only applies to col2!
Sorting Algorithm Limitations
Multi-column vs concatenated sort divergence
- Column-wise: Compares
col1
values first, usescol2
only for identicalcol1
- Concatenated: Treats
col1||col2
as single strings, enabling cross-column character comparisons
- Column-wise: Compares
Space removal altering sort hierarchy
Original: "Van den Berghe" vs "Vandenberghe" Processed: "Vandenberghe" vs "Vandenberghe" (equivalent) --> Secondary sort on fname expected but not executed
Collation Debugging Methodology and Permanent Fix Implementation
Step 1: Validate Collation Contract Compliance
Execute symmetry and transitivity tests:
-- Symmetry check
SELECT a.rowid, b.rowid
FROM names a, names b
WHERE a.rowid < b.rowid
AND (a.lname < b.lname COLLATE ascii7_nocase_nospace)
!= (b.lname > a.lname COLLATE ascii7_nocase_nospace);
-- Transitivity check
WITH Triplets AS (
SELECT a.lname AS x, b.lname AS y, c.lname AS z
FROM names a, names b, names c
WHERE a.rowid != b.rowid AND b.rowid != c.rowid
)
SELECT x, y, z
FROM Triplets
WHERE x = y COLLATE ascii7_nocase_nospace
AND y = z COLLATE ascii7_nocase_nospace
AND x != z COLLATE ascii7_nocase_nospace;
Step 2: Instrument Collation Function
Add debug logging to track string transformations:
static int ascii7Collate(void *data, int nKey1, const void *pKey1,
int nKey2, const void *pKey2) {
char *norm1 = normalize(pKey1, nKey1);
char *norm2 = normalize(pKey2, nKey2);
log("Comparing '%s'(%d) vs '%s'(%d)", norm1, nKey1, norm2, nKey2);
int result = strcmp(norm1, norm2);
free(norm1); free(norm2);
return result;
}
Step 3: Fix Memory Handling
Ensure proper buffer allocation and termination:
char* remove_spaces(const char* input, int length) {
char* output = malloc(length + 1);
int j = 0;
for (int i = 0; i < length; i++) {
if (input[i] != ' ') output[j++] = tolower(input[i]);
}
output[j] = '\0'; // Critical null terminator
return output;
}
Step 4: Implement Multi-Level Sorting
Modify collation to support tie-breaking:
static int ascii7Collate(void *data, int nKey1, const void *pKey1,
int nKey2, const void *pKey2) {
// Primary comparison: normalized content
int content_cmp = compare_normalized(pKey1, nKey1, pKey2, nKey2);
if (content_cmp != 0) return content_cmp;
// Secondary tie-breaker: original length
int len_diff = nKey1 - nKey2;
return (len_diff > 0) ? 1 : (len_diff < 0) ? -1 : 0;
}
Step 5: Explicit Collation Application
Modify queries to apply collation correctly:
-- Before (incorrect)
SELECT * FROM names
ORDER BY lname COLLATE ascii7_nocase_nospace, fname;
-- After (correct)
SELECT * FROM names
ORDER BY
lname COLLATE ascii7_nocase_nospace,
fname COLLATE ascii7_nocase_nospace;
Step 6: Benchmark Against Built-in Collations
Verify behavior parity using NOCASE:
EXPLAIN QUERY PLAN
SELECT rowid FROM names
ORDER BY lname COLLATE NOCASE, fname COLLATE NOCASE;
EXPLAIN QUERY PLAN
SELECT rowid FROM names
ORDER BY (lname || fname) COLLATE NOCASE;
Permanent Fix Checklist
- Null termination of all processed strings
- Buffer size validation for unequal input lengths
- Secondary sort criteria implementation in collation
- Explicit collation application to each sorted column
- Automated collation validation using unit test queries
By methodically addressing buffer management, collation contract requirements, and explicit sorting scope declaration, the custom collation achieves consistent behavior across both single-column and multi-column sorting scenarios. The critical realization lies in understanding that SQLite applies collations per-column in ORDER BY
clauses, requiring explicit designation for each sorted expression. Combined with rigorous memory safety practices in custom collation implementations, this resolves the observed discrepancies between concatenated and multi-column sorting approaches.