SQLite CLI UTF-16 Console I/O: Chinese Character Misalignment & Input Handling Issues

Understanding SQLite CLI’s Box Mode Rendering Errors and Input Processing Anomalies

Issue Overview
The core challenges reported in the discussion revolve around the SQLite Command-Line Interface (CLI) beta builds (specifically sqlite3_m.exe and sqlite3_l.exe) when handling Unicode characters, particularly Chinese text, in Windows environments. Two primary issues emerge:

  1. Box Mode Rendering Misalignment with Chinese Characters
    When using .mode box to format query results, Chinese characters (e.g., 版本 6.1.7601) introduce leading spaces that disrupt table alignment. This misalignment persists across console hosts like cmd.exe, ConEmu, and Far Manager, even when UTF-8 code pages (chcp 65001) are active.

  2. Input Handling Inconsistencies

    • The legacy build (sqlite3_l.exe) exits silently or fails to process input after switching to UTF-8 mode (chcp 65001).
    • Pasting multi-line SQL statements into the CLI occasionally triggers syntax errors, particularly when the pasted content includes extraneous text (e.g., sqlite3_m.exe).
    • Alt-code entry discrepancies (e.g., ALT-9794 for ♂) between console hosts, with some characters rendering correctly in Far Manager but not in cmd.exe.

Root Causes of Rendering and Input Failures

  1. Unicode Width Calculation Errors in Box Mode
    The SQLite CLI’s box-drawing logic relies on ANSI-era character width assumptions, where each printable character is assumed to occupy 1 column. However, Unicode characters from East Asian languages (e.g., Chinese, Japanese) often use full-width glyphs that occupy 2 columns. The CLI’s failure to adjust for this discrepancy leads to misaligned tables.

    Example:
    The string '版本 6.1.7601' contains four Chinese characters (, , 6, ., etc.). Each Chinese ideograph is a full-width character (2 columns), while ASCII characters (digits, dots) are half-width (1 column). The CLI incorrectly sums their widths, causing the leading space and misalignment.

  2. MBCS-to-UTF-16 Translation Flaws in Legacy Builds
    The sqlite3_l.exe (legacy) build uses Multi-Byte Character Set (MBCS) encoding, which struggles with UTF-8 conversions on consoles configured for UTF-16 output. When chcp 65001 is set:

    • Input buffers may be truncated or misinterpreted due to incomplete MBCS sequences.
    • The console’s UTF-8 input mode conflicts with the CLI’s MBCS-to-UTF-16 translation layer, causing silent exits or unprocessed input.
  3. Terminal Host-Specific Alt-Code Handling
    Alt-code entry behavior varies across console hosts:

    • Far Manager and modern terminals (e.g., Windows Terminal) directly map Alt-codes to Unicode code points.
    • Legacy cmd.exe relies on OEM code pages, translating Alt-codes (e.g., ALT-9794) into ANSI characters, which may not correspond to valid Unicode glyphs.
  4. Paste Buffer Corruption in Console Hosts
    Pasting multi-line SQL into the CLI may introduce hidden artifacts (e.g., line breaks, terminal escape sequences) depending on the console’s clipboard handling. For example:

    CREATE VIEW t3(x) AS SELECT 0;
    sqlite3_m.exe ;  -- Accidental inclusion of executable name
    

    The rogue sqlite3_m.exe text is parsed as part of the SQL input, triggering syntax errors.

Resolving Rendering, Input, and Compatibility Issues

1. Correcting Box Mode Alignment for East Asian Characters

Step 1: Audit Unicode Width Calculation Logic
The SQLite CLI’s table formatting code (in shell.c, printf functions) must integrate Unicode East Asian Width (EAW) properties. This involves:

  • Using the wcwidth() function (or equivalent) to determine column widths for Unicode characters.
  • For Windows builds, implement a lookup table for EAW categories (e.g., "Wide" = 2 columns, "Narrow" = 1 column).

Step 2: Patch the Box Drawing Algorithm
Modify the box rendering logic to account for variable-width characters:

// Example adjustment in shell.c (simplified)
int calculate_display_width(const char *text) {
  int width = 0;
  for (const char *p = text; *p; p++) {
    if (is_utf8_continuation(*p)) continue;
    uint32_t code_point = decode_utf8(&p);
    width += (uc_is_wide(code_point) ? 2 : 1);
  }
  return width;
}

This ensures that full-width characters contribute 2 columns to the table layout.

Step 3: Test with Diverse Unicode Inputs
Validate alignment using mixed scripts:

.mode box
SELECT '中文' AS a, 'test' AS b, '日本' AS c;

Expected output:

┌────┬──────┬────┐
│ a  │  b  │ c  │
├────┼──────┼────┤
│ 中文 │ test │ 日本 │
└────┴──────┴────┘

2. Fixing Legacy Build Input Failures in UTF-8 Mode

Step 1: Diagnose MBCS-to-UTF-16 Translation
The sqlite3_l.exe build uses SetConsoleCP(65001) to enable UTF-8 input. However, Windows’ MBCS layer has known bugs with UTF-8 code pages. To mitigate:

  • Replace MBCS I/O with direct UTF-16 APIs (ReadConsoleW, WriteConsoleW) even in the legacy build.
  • Deprecate MBCS-only code paths for Windows versions post-Windows 7.

Step 2: Handle Incomplete Byte Sequences
Modify input buffering to detect and retain partial UTF-8 sequences across fread() calls:

// In console_read() (shell.c):
static char input_buffer[4096];
static size_t buffer_pos = 0;

size_t bytes_read = fread(input_buffer + buffer_pos, 1, sizeof(input_buffer) - buffer_pos, stdin);
buffer_pos += bytes_read;

// Process complete UTF-8 sequences, leave incomplete bytes in buffer
size_t valid_bytes = utf8_valid_up_to(input_buffer, buffer_pos);
process_input(input_buffer, valid_bytes);
memmove(input_buffer, input_buffer + valid_bytes, buffer_pos - valid_bytes);
buffer_pos -= valid_bytes;

Step 3: Validate with Chinese Input
Test sqlite3_l.exe under chcp 65001:

SELECT '版本 6.1.7601' AS test;

Expected output (no silent exit):

test
----------
版本 6.1.7601

3. Addressing Alt-Code and Console Host Discrepancies

Step 1: Standardize Alt-Code Handling
Map Alt-code entries to Unicode code points using VkKeyScanW and ToUnicode Win32 APIs, bypassing OEM code pages:

// In console keyboard input handler:
wchar_t buffer[4];
int count = ToUnicode(vk, scan_code, keyboard_state, buffer, 4, 0);
if (count > 0) {
  // Append buffer to input stream
}

Step 2: Console Host Workarounds
Document known differences between terminals:

  • Far Manager: Direct Unicode input via Alt-codes works.
  • cmd.exe: Use chcp 65001 and hexadecimal Alt-codes (e.g., ALT+026U+001A).

4. Mitigating Paste-Induced Syntax Errors

Step 1: Sanitize Clipboard Input
Strip non-SQL text (e.g., sqlite3_m.exe) from pasted content using a regex filter:

// In paste handler:
const char *sql_pattern = "(?:^|\\n)(sqlite3[_.].*?\\n)";
remove_matches(input, sql_pattern);

Step 2: Improve Error Messaging
When a syntax error occurs, highlight the exact position and context:

Parse error near "sqlite3_m" (column 1):
  sqlite3_m.exe ;
  ^-- Expected SQL keyword

Step 3: Validate with Problematic Pastes
Test pasting the problematic SQL snippet:

CREATE VIEW t3(x) AS SELECT 0;
sqlite3_m.exe ;

Expected outcome:

  • sqlite3_m.exe is ignored or triggers a clear error message.

Compatibility Workarounds for End Users

  1. Use UTF-16 Builds (sqlite3_m.exe or sqlite3_x64.exe)
    Prefer these over legacy builds for consistent Unicode handling.

  2. Avoid chcp 65001 with Legacy Builds
    Stick to default code pages (e.g., chcp 936 for Chinese) when using sqlite3_l.exe.

  3. Alternative Box Modes
    Use .mode markdown or .mode json as temporary workarounds for alignment issues.

  4. Terminal Recommendations
    Modern consoles like Windows Terminal or MSYS2 provide better Unicode support and fewer Alt-code quirks.

Conclusion
The SQLite CLI’s beta console I/O improvements mark a significant step toward robust Unicode support. However, addressing East Asian character widths, standardizing input handling across console hosts, and enhancing error resilience will solidify its reliability for global users. Developers should prioritize integrating Unicode-aware width calculations and deprecating MBCS code paths in favor of UTF-16/Win32 APIs. End users can mitigate issues by adopting modern CLI builds and terminals while avoiding legacy configurations prone to encoding conflicts.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *