Resolving CSV-to-Relational Mapping and Normalization Errors in SQLite Many-to-Many Designs


Structural Flaws in CSV-to-Relational Data Parsing and Normalization

Issue Overview: Faulty Schema Design and Trigger-Based CSV Parsing

The core challenge involves converting comma-separated values (CSV) stored in a single column into a normalized relational structure using a many-to-many relationship. The original implementation uses a trigger to split CSV strings into separate rows via SQLite’s json_each function, populating a junction table (t_x_codes) and a lookup table (codes). While functional for small datasets, this approach introduces critical flaws:

  1. Normalization Violations:

    • The t table’s codes column violates First Normal Form (1NF) by storing multiple values in a single field. This creates data redundancy and complicates atomicity guarantees.
    • The codes table’s existence is questioned: If codes are context-dependent (e.g., "dog" meaning different things for different t entries), storing them centrally is semantically incorrect. If codes are universal, the junction table (t_x_codes) redundantly duplicates code values instead of referencing surrogate keys.
  2. Data Integrity Risks:

    • The trigger uses INSERT OR REPLACE on the codes table, which may inadvertently overwrite existing entries if case sensitivity or contextual differences exist (e.g., "Foo" vs. "foo").
    • No foreign key constraints enforce referential integrity between t_x_codes and codes, risking orphaned records.
  3. Performance and Scalability:

    • The AFTER INSERT trigger processes each new t row individually. For bulk inserts (e.g., "100s of 1000s of rows"), this row-by-row processing becomes inefficient.
    • The JSON-based CSV parsing method (json_each('["' || Replace(...) || '"]')) is fragile. It fails to handle CSV strings containing quotes, escaped commas, or Unicode characters.
  4. Schema Redundancy:

    • The t table retains the original CSV string in the codes column, creating a synchronization burden. If codes are modified via the codes or t_x_codes tables, the t.codes field becomes outdated.

Root Causes: Misapplied Normalization and Fragile Parsing Logic

  1. Misunderstanding Many-to-Many Relationships:

    • The junction table (t_x_codes) directly stores code values instead of foreign keys to a normalized codes table. This conflates two distinct concepts:
      • Business Entity: A code’s semantic meaning (stored in codes).
      • Relationship: The association between a t entry and a code (stored in t_x_codes).
    • If codes are immutable and globally unique, t_x_codes should reference codes.code via a foreign key. If codes are context-specific to t entries, the codes table is unnecessary.
  2. Improper Use of Triggers for Data Transformation:

    • Triggers are executed synchronously during row insertion. Parsing CSV strings within a trigger forces all downstream logic (code insertion, junction population) into a transactional context, increasing lock contention and slowing bulk operations.
    • The JSON-based CSV splitting assumes ideal input formats. Real-world data often includes edge cases (e.g., "value, with \"internal\" quotes") that break naive string replacement.
  3. Lack of Data Validation and Constraints:

    • No UNIQUE constraint on t.title allows duplicate titles with different tid values, complicating data retrieval.
    • Case sensitivity in code comparisons (e.g., "Foo" vs. "foo") is unhandled, leading to duplicate entries in codes unless COLLATE NOCASE is applied.
  4. Redundant Storage of Original CSV Data:

    • Retaining the raw CSV string in t.codes violates normalization principles. Derived data should be computable from the normalized tables (e.g., via a VIEW).

Solutions: Schema Refactoring, Robust Parsing, and Batch Optimization

Step 1: Normalize the Schema Correctly
Option A: Universal Code Definitions
If codes have universal meanings:

-- Use surrogate keys for codes to minimize redundancy
CREATE TABLE codes (
  cid INTEGER PRIMARY KEY,
  code TEXT UNIQUE NOT NULL COLLATE NOCASE
);

-- Junction table references surrogate keys
CREATE TABLE t_x_codes (
  tid INTEGER NOT NULL REFERENCES t(tid),
  cid INTEGER NOT NULL REFERENCES codes(cid),
  PRIMARY KEY (tid, cid)
) WITHOUT ROWID;

-- Remove t.codes column; derive CSV via VIEW
CREATE TABLE t (
  tid INTEGER PRIMARY KEY,
  title TEXT UNIQUE NOT NULL
);

-- Reconstruct original CSV via aggregation
CREATE VIEW t_with_codes AS
SELECT t.tid, t.title, GROUP_CONCAT(c.code, ',') AS codes
FROM t
LEFT JOIN t_x_codes tc ON t.tid = tc.tid
LEFT JOIN codes c ON tc.cid = c.cid
GROUP BY t.tid;

Option B: Context-Specific Codes
If codes are only meaningful within their parent t entry:

-- Eliminate codes table; store directly in junction
CREATE TABLE t_x_codes (
  tid INTEGER NOT NULL REFERENCES t(tid),
  code TEXT NOT NULL,
  PRIMARY KEY (tid, code)
) WITHOUT ROWID;

Step 2: Replace Trigger with Batch CSV Parsing
Avoid row-level triggers for bulk operations. Instead, parse CSVs during initial insertion using a temporary table:

-- Temporary staging table for raw data
CREATE TEMP TABLE t_staging (
  title TEXT,
  codes TEXT
);

-- Bulk insert raw data
INSERT INTO t_staging (title, codes)
VALUES 
  ('first t', 'foo,bar,baz,qux'),
  ('second t', 'one,two,thr'),
  ('third t','red,grn,blu,blk,cyn'),
  ('fourth t', 'cat,dog,cow'),
  ('fifth t null code', NULL),
  ('sixth t blank code', '');

-- Insert into t, ignoring codes
INSERT INTO t (title)
SELECT title FROM t_staging;

-- Parse codes in bulk using recursive CTE
WITH RECURSIVE split(tid, code, rest) AS (
  SELECT t.tid, 
         SUBSTR(ts.codes, 1, INSTR(ts.codes || ',', ',') - 1),
         SUBSTR(ts.codes, INSTR(ts.codes || ',', ',') + 1)
  FROM t
  JOIN t_staging ts ON t.title = ts.title
  WHERE ts.codes IS NOT NULL AND ts.codes != ''
  UNION ALL
  SELECT tid,
         SUBSTR(rest, 1, INSTR(rest || ',', ',') - 1),
         SUBSTR(rest, INSTR(rest || ',', ',') + 1)
  FROM split
  WHERE rest != ''
)
INSERT OR IGNORE INTO codes (code)
SELECT DISTINCT code FROM split WHERE code != '';

-- Populate junction table
INSERT INTO t_x_codes (tid, cid)
SELECT s.tid, c.cid
FROM split s
JOIN codes c ON s.code = c.code;

Step 3: Handle Edge Cases in CSV Parsing
Use a robust splitting mechanism that accounts for quoted strings and escaped commas:

-- Custom split function using recursive CTE
CREATE TEMPORARY TABLE split_temp AS
WITH RECURSIVE splitter(tid, code, rest, depth) AS (
  SELECT t.tid,
         '',
         ts.codes || ',',
         0
  FROM t
  JOIN t_staging ts ON t.title = ts.title
  WHERE ts.codes IS NOT NULL AND ts.codes != ''
  UNION ALL
  SELECT tid,
         CASE 
           WHEN SUBSTR(rest, 1, 1) = '"' THEN
             SUBSTR(rest, 2, INSTR(SUBSTR(rest, 2), '"') - 1)
           ELSE
             SUBSTR(rest, 1, INSTR(rest || ',', ',') - 1)
         END,
         CASE 
           WHEN SUBSTR(rest, 1, 1) = '"' THEN
             SUBSTR(rest, INSTR(SUBSTR(rest, 2), '"') + 3)
           ELSE
             SUBSTR(rest, INSTR(rest || ',', ',') + 1)
         END,
         depth + 1
  FROM splitter
  WHERE rest != '' AND depth < 100 -- Prevent infinite recursion
)
SELECT tid, TRIM(code) AS code
FROM splitter
WHERE code != '' AND depth > 0;

Step 4: Enforce Data Consistency

  • Add foreign key constraints:
    PRAGMA foreign_keys = ON; -- Enable FK enforcement
    
  • Apply COLLATE NOCASE to code columns for case insensitivity.
  • Remove redundant t.codes column; use VIEW to reconstruct CSV.

Step 5: Optimize Bulk Insert Performance

  • Disable triggers and foreign key checks during bulk inserts:
    PRAGMA foreign_keys = OFF;
    PRAGMA defer_foreign_keys = ON;
    -- Perform bulk inserts
    PRAGMA foreign_keys = ON;
    
  • Use batched transactions:
    BEGIN TRANSACTION;
    -- Bulk insert logic
    COMMIT;
    

Step 6: Validate and Test

  • Verify normalization:
    -- Ensure no duplicate codes
    SELECT code, COUNT(*) FROM codes GROUP BY code HAVING COUNT(*) > 1;
    
  • Check orphaned records:
    -- Orphaned t_x_codes entries
    SELECT * FROM t_x_codes WHERE tid NOT IN (SELECT tid FROM t);
    SELECT * FROM t_x_codes WHERE cid NOT IN (SELECT cid FROM codes);
    

By addressing normalization errors, replacing row-level triggers with batch processing, and implementing robust CSV parsing, this approach ensures scalability, data integrity, and maintainability. The final schema eliminates redundancy while preserving the ability to reconstruct original CSV data through views or application-layer aggregation.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *