Resolving CSV-to-Relational Mapping and Normalization Errors in SQLite Many-to-Many Designs
Structural Flaws in CSV-to-Relational Data Parsing and Normalization
Issue Overview: Faulty Schema Design and Trigger-Based CSV Parsing
The core challenge involves converting comma-separated values (CSV) stored in a single column into a normalized relational structure using a many-to-many relationship. The original implementation uses a trigger to split CSV strings into separate rows via SQLite’s json_each
function, populating a junction table (t_x_codes
) and a lookup table (codes
). While functional for small datasets, this approach introduces critical flaws:
Normalization Violations:
- The
t
table’scodes
column violates First Normal Form (1NF) by storing multiple values in a single field. This creates data redundancy and complicates atomicity guarantees. - The
codes
table’s existence is questioned: If codes are context-dependent (e.g., "dog" meaning different things for differentt
entries), storing them centrally is semantically incorrect. If codes are universal, the junction table (t_x_codes
) redundantly duplicates code values instead of referencing surrogate keys.
- The
Data Integrity Risks:
- The trigger uses
INSERT OR REPLACE
on thecodes
table, which may inadvertently overwrite existing entries if case sensitivity or contextual differences exist (e.g., "Foo" vs. "foo"). - No foreign key constraints enforce referential integrity between
t_x_codes
andcodes
, risking orphaned records.
- The trigger uses
Performance and Scalability:
- The
AFTER INSERT
trigger processes each newt
row individually. For bulk inserts (e.g., "100s of 1000s of rows"), this row-by-row processing becomes inefficient. - The JSON-based CSV parsing method (
json_each('["' || Replace(...) || '"]')
) is fragile. It fails to handle CSV strings containing quotes, escaped commas, or Unicode characters.
- The
Schema Redundancy:
- The
t
table retains the original CSV string in thecodes
column, creating a synchronization burden. If codes are modified via thecodes
ort_x_codes
tables, thet.codes
field becomes outdated.
- The
Root Causes: Misapplied Normalization and Fragile Parsing Logic
Misunderstanding Many-to-Many Relationships:
- The junction table (
t_x_codes
) directly stores code values instead of foreign keys to a normalizedcodes
table. This conflates two distinct concepts:- Business Entity: A code’s semantic meaning (stored in
codes
). - Relationship: The association between a
t
entry and a code (stored int_x_codes
).
- Business Entity: A code’s semantic meaning (stored in
- If codes are immutable and globally unique,
t_x_codes
should referencecodes.code
via a foreign key. If codes are context-specific tot
entries, thecodes
table is unnecessary.
- The junction table (
Improper Use of Triggers for Data Transformation:
- Triggers are executed synchronously during row insertion. Parsing CSV strings within a trigger forces all downstream logic (code insertion, junction population) into a transactional context, increasing lock contention and slowing bulk operations.
- The JSON-based CSV splitting assumes ideal input formats. Real-world data often includes edge cases (e.g.,
"value, with \"internal\" quotes"
) that break naive string replacement.
Lack of Data Validation and Constraints:
- No
UNIQUE
constraint ont.title
allows duplicate titles with differenttid
values, complicating data retrieval. - Case sensitivity in code comparisons (e.g., "Foo" vs. "foo") is unhandled, leading to duplicate entries in
codes
unlessCOLLATE NOCASE
is applied.
- No
Redundant Storage of Original CSV Data:
- Retaining the raw CSV string in
t.codes
violates normalization principles. Derived data should be computable from the normalized tables (e.g., via aVIEW
).
- Retaining the raw CSV string in
Solutions: Schema Refactoring, Robust Parsing, and Batch Optimization
Step 1: Normalize the Schema Correctly
Option A: Universal Code Definitions
If codes have universal meanings:
-- Use surrogate keys for codes to minimize redundancy
CREATE TABLE codes (
cid INTEGER PRIMARY KEY,
code TEXT UNIQUE NOT NULL COLLATE NOCASE
);
-- Junction table references surrogate keys
CREATE TABLE t_x_codes (
tid INTEGER NOT NULL REFERENCES t(tid),
cid INTEGER NOT NULL REFERENCES codes(cid),
PRIMARY KEY (tid, cid)
) WITHOUT ROWID;
-- Remove t.codes column; derive CSV via VIEW
CREATE TABLE t (
tid INTEGER PRIMARY KEY,
title TEXT UNIQUE NOT NULL
);
-- Reconstruct original CSV via aggregation
CREATE VIEW t_with_codes AS
SELECT t.tid, t.title, GROUP_CONCAT(c.code, ',') AS codes
FROM t
LEFT JOIN t_x_codes tc ON t.tid = tc.tid
LEFT JOIN codes c ON tc.cid = c.cid
GROUP BY t.tid;
Option B: Context-Specific Codes
If codes are only meaningful within their parent t
entry:
-- Eliminate codes table; store directly in junction
CREATE TABLE t_x_codes (
tid INTEGER NOT NULL REFERENCES t(tid),
code TEXT NOT NULL,
PRIMARY KEY (tid, code)
) WITHOUT ROWID;
Step 2: Replace Trigger with Batch CSV Parsing
Avoid row-level triggers for bulk operations. Instead, parse CSVs during initial insertion using a temporary table:
-- Temporary staging table for raw data
CREATE TEMP TABLE t_staging (
title TEXT,
codes TEXT
);
-- Bulk insert raw data
INSERT INTO t_staging (title, codes)
VALUES
('first t', 'foo,bar,baz,qux'),
('second t', 'one,two,thr'),
('third t','red,grn,blu,blk,cyn'),
('fourth t', 'cat,dog,cow'),
('fifth t null code', NULL),
('sixth t blank code', '');
-- Insert into t, ignoring codes
INSERT INTO t (title)
SELECT title FROM t_staging;
-- Parse codes in bulk using recursive CTE
WITH RECURSIVE split(tid, code, rest) AS (
SELECT t.tid,
SUBSTR(ts.codes, 1, INSTR(ts.codes || ',', ',') - 1),
SUBSTR(ts.codes, INSTR(ts.codes || ',', ',') + 1)
FROM t
JOIN t_staging ts ON t.title = ts.title
WHERE ts.codes IS NOT NULL AND ts.codes != ''
UNION ALL
SELECT tid,
SUBSTR(rest, 1, INSTR(rest || ',', ',') - 1),
SUBSTR(rest, INSTR(rest || ',', ',') + 1)
FROM split
WHERE rest != ''
)
INSERT OR IGNORE INTO codes (code)
SELECT DISTINCT code FROM split WHERE code != '';
-- Populate junction table
INSERT INTO t_x_codes (tid, cid)
SELECT s.tid, c.cid
FROM split s
JOIN codes c ON s.code = c.code;
Step 3: Handle Edge Cases in CSV Parsing
Use a robust splitting mechanism that accounts for quoted strings and escaped commas:
-- Custom split function using recursive CTE
CREATE TEMPORARY TABLE split_temp AS
WITH RECURSIVE splitter(tid, code, rest, depth) AS (
SELECT t.tid,
'',
ts.codes || ',',
0
FROM t
JOIN t_staging ts ON t.title = ts.title
WHERE ts.codes IS NOT NULL AND ts.codes != ''
UNION ALL
SELECT tid,
CASE
WHEN SUBSTR(rest, 1, 1) = '"' THEN
SUBSTR(rest, 2, INSTR(SUBSTR(rest, 2), '"') - 1)
ELSE
SUBSTR(rest, 1, INSTR(rest || ',', ',') - 1)
END,
CASE
WHEN SUBSTR(rest, 1, 1) = '"' THEN
SUBSTR(rest, INSTR(SUBSTR(rest, 2), '"') + 3)
ELSE
SUBSTR(rest, INSTR(rest || ',', ',') + 1)
END,
depth + 1
FROM splitter
WHERE rest != '' AND depth < 100 -- Prevent infinite recursion
)
SELECT tid, TRIM(code) AS code
FROM splitter
WHERE code != '' AND depth > 0;
Step 4: Enforce Data Consistency
- Add foreign key constraints:
PRAGMA foreign_keys = ON; -- Enable FK enforcement
- Apply
COLLATE NOCASE
to code columns for case insensitivity. - Remove redundant
t.codes
column; useVIEW
to reconstruct CSV.
Step 5: Optimize Bulk Insert Performance
- Disable triggers and foreign key checks during bulk inserts:
PRAGMA foreign_keys = OFF; PRAGMA defer_foreign_keys = ON; -- Perform bulk inserts PRAGMA foreign_keys = ON;
- Use batched transactions:
BEGIN TRANSACTION; -- Bulk insert logic COMMIT;
Step 6: Validate and Test
- Verify normalization:
-- Ensure no duplicate codes SELECT code, COUNT(*) FROM codes GROUP BY code HAVING COUNT(*) > 1;
- Check orphaned records:
-- Orphaned t_x_codes entries SELECT * FROM t_x_codes WHERE tid NOT IN (SELECT tid FROM t); SELECT * FROM t_x_codes WHERE cid NOT IN (SELECT cid FROM codes);
By addressing normalization errors, replacing row-level triggers with batch processing, and implementing robust CSV parsing, this approach ensures scalability, data integrity, and maintainability. The final schema eliminates redundancy while preserving the ability to reconstruct original CSV data through views or application-layer aggregation.