SQLite CLI UTF-16 Console I/O: Chinese Character Misalignment & Input Handling Issues
Understanding SQLite CLI’s Box Mode Rendering Errors and Input Processing Anomalies
Issue Overview
The core challenges reported in the discussion revolve around the SQLite Command-Line Interface (CLI) beta builds (specifically sqlite3_m.exe
and sqlite3_l.exe
) when handling Unicode characters, particularly Chinese text, in Windows environments. Two primary issues emerge:
Box Mode Rendering Misalignment with Chinese Characters
When using.mode box
to format query results, Chinese characters (e.g.,版本 6.1.7601
) introduce leading spaces that disrupt table alignment. This misalignment persists across console hosts likecmd.exe
,ConEmu
, andFar Manager
, even when UTF-8 code pages (chcp 65001
) are active.Input Handling Inconsistencies
- The legacy build (
sqlite3_l.exe
) exits silently or fails to process input after switching to UTF-8 mode (chcp 65001
). - Pasting multi-line SQL statements into the CLI occasionally triggers syntax errors, particularly when the pasted content includes extraneous text (e.g.,
sqlite3_m.exe
). - Alt-code entry discrepancies (e.g.,
ALT-9794
for ♂) between console hosts, with some characters rendering correctly inFar Manager
but not incmd.exe
.
- The legacy build (
Root Causes of Rendering and Input Failures
Unicode Width Calculation Errors in Box Mode
The SQLite CLI’s box-drawing logic relies on ANSI-era character width assumptions, where each printable character is assumed to occupy 1 column. However, Unicode characters from East Asian languages (e.g., Chinese, Japanese) often use full-width glyphs that occupy 2 columns. The CLI’s failure to adjust for this discrepancy leads to misaligned tables.Example:
The string'版本 6.1.7601'
contains four Chinese characters (版
,本
,6
,.
, etc.). Each Chinese ideograph is a full-width character (2 columns), while ASCII characters (digits, dots) are half-width (1 column). The CLI incorrectly sums their widths, causing the leading space and misalignment.MBCS-to-UTF-16 Translation Flaws in Legacy Builds
Thesqlite3_l.exe
(legacy) build uses Multi-Byte Character Set (MBCS) encoding, which struggles with UTF-8 conversions on consoles configured for UTF-16 output. Whenchcp 65001
is set:- Input buffers may be truncated or misinterpreted due to incomplete MBCS sequences.
- The console’s UTF-8 input mode conflicts with the CLI’s MBCS-to-UTF-16 translation layer, causing silent exits or unprocessed input.
Terminal Host-Specific Alt-Code Handling
Alt-code entry behavior varies across console hosts:Far Manager
and modern terminals (e.g., Windows Terminal) directly map Alt-codes to Unicode code points.- Legacy
cmd.exe
relies on OEM code pages, translating Alt-codes (e.g.,ALT-9794
) into ANSI characters, which may not correspond to valid Unicode glyphs.
Paste Buffer Corruption in Console Hosts
Pasting multi-line SQL into the CLI may introduce hidden artifacts (e.g., line breaks, terminal escape sequences) depending on the console’s clipboard handling. For example:CREATE VIEW t3(x) AS SELECT 0; sqlite3_m.exe ; -- Accidental inclusion of executable name
The rogue
sqlite3_m.exe
text is parsed as part of the SQL input, triggering syntax errors.
Resolving Rendering, Input, and Compatibility Issues
1. Correcting Box Mode Alignment for East Asian Characters
Step 1: Audit Unicode Width Calculation Logic
The SQLite CLI’s table formatting code (in shell.c
, printf
functions) must integrate Unicode East Asian Width (EAW) properties. This involves:
- Using the
wcwidth()
function (or equivalent) to determine column widths for Unicode characters. - For Windows builds, implement a lookup table for EAW categories (e.g., "Wide" = 2 columns, "Narrow" = 1 column).
Step 2: Patch the Box Drawing Algorithm
Modify the box rendering
logic to account for variable-width characters:
// Example adjustment in shell.c (simplified)
int calculate_display_width(const char *text) {
int width = 0;
for (const char *p = text; *p; p++) {
if (is_utf8_continuation(*p)) continue;
uint32_t code_point = decode_utf8(&p);
width += (uc_is_wide(code_point) ? 2 : 1);
}
return width;
}
This ensures that full-width characters contribute 2 columns to the table layout.
Step 3: Test with Diverse Unicode Inputs
Validate alignment using mixed scripts:
.mode box
SELECT '中文' AS a, 'test' AS b, '日本' AS c;
Expected output:
┌────┬──────┬────┐
│ a │ b │ c │
├────┼──────┼────┤
│ 中文 │ test │ 日本 │
└────┴──────┴────┘
2. Fixing Legacy Build Input Failures in UTF-8 Mode
Step 1: Diagnose MBCS-to-UTF-16 Translation
The sqlite3_l.exe
build uses SetConsoleCP(65001)
to enable UTF-8 input. However, Windows’ MBCS layer has known bugs with UTF-8 code pages. To mitigate:
- Replace MBCS I/O with direct UTF-16 APIs (
ReadConsoleW
,WriteConsoleW
) even in the legacy build. - Deprecate MBCS-only code paths for Windows versions post-Windows 7.
Step 2: Handle Incomplete Byte Sequences
Modify input buffering to detect and retain partial UTF-8 sequences across fread()
calls:
// In console_read() (shell.c):
static char input_buffer[4096];
static size_t buffer_pos = 0;
size_t bytes_read = fread(input_buffer + buffer_pos, 1, sizeof(input_buffer) - buffer_pos, stdin);
buffer_pos += bytes_read;
// Process complete UTF-8 sequences, leave incomplete bytes in buffer
size_t valid_bytes = utf8_valid_up_to(input_buffer, buffer_pos);
process_input(input_buffer, valid_bytes);
memmove(input_buffer, input_buffer + valid_bytes, buffer_pos - valid_bytes);
buffer_pos -= valid_bytes;
Step 3: Validate with Chinese Input
Test sqlite3_l.exe
under chcp 65001
:
SELECT '版本 6.1.7601' AS test;
Expected output (no silent exit):
test
----------
版本 6.1.7601
3. Addressing Alt-Code and Console Host Discrepancies
Step 1: Standardize Alt-Code Handling
Map Alt-code entries to Unicode code points using VkKeyScanW
and ToUnicode
Win32 APIs, bypassing OEM code pages:
// In console keyboard input handler:
wchar_t buffer[4];
int count = ToUnicode(vk, scan_code, keyboard_state, buffer, 4, 0);
if (count > 0) {
// Append buffer to input stream
}
Step 2: Console Host Workarounds
Document known differences between terminals:
- Far Manager: Direct Unicode input via Alt-codes works.
- cmd.exe: Use
chcp 65001
and hexadecimal Alt-codes (e.g.,ALT+026
→U+001A
).
4. Mitigating Paste-Induced Syntax Errors
Step 1: Sanitize Clipboard Input
Strip non-SQL text (e.g., sqlite3_m.exe
) from pasted content using a regex filter:
// In paste handler:
const char *sql_pattern = "(?:^|\\n)(sqlite3[_.].*?\\n)";
remove_matches(input, sql_pattern);
Step 2: Improve Error Messaging
When a syntax error occurs, highlight the exact position and context:
Parse error near "sqlite3_m" (column 1):
sqlite3_m.exe ;
^-- Expected SQL keyword
Step 3: Validate with Problematic Pastes
Test pasting the problematic SQL snippet:
CREATE VIEW t3(x) AS SELECT 0;
sqlite3_m.exe ;
Expected outcome:
sqlite3_m.exe
is ignored or triggers a clear error message.
Compatibility Workarounds for End Users
Use UTF-16 Builds (
sqlite3_m.exe
orsqlite3_x64.exe
)
Prefer these over legacy builds for consistent Unicode handling.Avoid
chcp 65001
with Legacy Builds
Stick to default code pages (e.g.,chcp 936
for Chinese) when usingsqlite3_l.exe
.Alternative Box Modes
Use.mode markdown
or.mode json
as temporary workarounds for alignment issues.Terminal Recommendations
Modern consoles like Windows Terminal or MSYS2 provide better Unicode support and fewer Alt-code quirks.
Conclusion
The SQLite CLI’s beta console I/O improvements mark a significant step toward robust Unicode support. However, addressing East Asian character widths, standardizing input handling across console hosts, and enhancing error resilience will solidify its reliability for global users. Developers should prioritize integrating Unicode-aware width calculations and deprecating MBCS code paths in favor of UTF-16/Win32 APIs. End users can mitigate issues by adopting modern CLI builds and terminals while avoiding legacy configurations prone to encoding conflicts.