Modeling Hierarchical Manuscript Data for Efficient Grammar Querying in SQLite
Manuscript Segmentation Hierarchy & Grammar Query Challenges
Issue Overview: Hierarchical Word Decomposition in Manuscript Analysis
The core challenge involves structuring a SQLite database to represent decomposed Hebrew manuscript words in a hierarchical manner, enabling efficient querying of grammatical constructs across multiple segmentation levels. The user’s data consists of three nested layers: manuscript entries (e.g., "andinblood-grapes"), their hyphenated splits ("andinblood", "grapes"), and parsed grammatical components ("and", "in", "blood"). This tree structure must support complex queries like finding all instances where a preposition precedes specific noun forms with particular articles, requiring cross-level relationships to be immediately queryable.
Existing schema attempts use a single table with positional indicators (indexManu, indexManuPos, parsePos) to differentiate hierarchy levels. While this denormalized approach simplifies initial data entry through group concatenation operations, it creates ambiguity in parent-child relationships between segmentation levels. For example, the entry for "andinblood" (indexManuPos=1) implicitly belongs to "andinblood-grapes" (indexManuPos=0) under the same indexManu=4, but SQLite has no built-in mechanism to automatically enforce or navigate these relationships without explicit foreign keys.
The grammatical query requirements add another layer of complexity. A search for prepositions preceding masculine singular nouns requires joining grammatical metadata across adjacent parsed components within the same split group. With the current flat structure, this necessitates window functions or self-joins that become computationally intensive as data scales. Moreover, the mixing of different entity types (full manuscript entries, splits, parsed particles) in one table complicates index optimization, as queries may need to filter simultaneously on hierarchy position (parsePos), grammatical class, and word position.
Structural Limitations in Hierarchical Representation and Query Patterns
1. Implicit Hierarchy Through Positional Columns
The current design uses indexManuPos and parsePos to imply nesting through position numbering rather than explicit parent-child relationships. This forces all hierarchy navigation logic into application code or complex CTE queries. For instance, determining that "andinblood" is a child of "andinblood-grapes" requires calculating that indexManuPos=1 falls under indexManuPos=0 for the same indexManu. While SQLite’s WITH RECURSIVE can theoretically traverse this, the lack of explicit parent keys makes recursive queries fragile—any inconsistency in positional numbering breaks the hierarchy.
2. Mixed Entity Types in Single Table
Storing manuscript entries, splits, and parsed particles in the same table leads to sparse columns and indexing inefficiencies. The "grammar" column only applies to parsed particles (parsePos >=1), while "original" contains composite values at higher hierarchy levels. Queries targeting specific grammatical features must always include parsePos >0 in WHERE clauses, preventing effective use of partial indexes. Additionally, updating a parsed particle’s grammar (e.g., changing "in" from preposition to particle) requires row-level updates rather than normalized grammatical reference tables.
3. Lack of Transitive Closure Support
SQLite’s default installation doesn’t include the transitive closure extension, which would allow efficient pathfinding in hierarchical data. Without it, recursive CTEs must recompute ancestor/descendant relationships on every query. For large manuscripts with deep parsing (e.g., 10+ parsed components per split), this results in exponential growth in temporary query steps. The current schema exacerbates this by requiring CTEs to first derive parent-child relationships from positional columns before applying grammatical filters.
4. Composite Key Collisions
Using indexManu + indexManuPos + parsePos as a composite primary key creates collision risks when modifying parsed structures. Inserting a new parsed particle between existing positions (e.g., adding a prefix at parsePos=1) necessitates incrementing all subsequent parsePos values manually—a process prone to errors that could orphan child nodes or create duplicate keys.
Optimized Schema Design and Query Strategies
Step 1: Implement Explicit Parent-Child Relationships
Replace positional columns with explicit foreign keys to establish hierarchy:
CREATE TABLE manuscript_nodes (
node_id INTEGER PRIMARY KEY,
parent_id INTEGER REFERENCES manuscript_nodes(node_id),
hierarchy_level TEXT CHECK(level IN ('manuscript', 'split', 'parsed')),
original_text TEXT NOT NULL,
grammar_class TEXT,
start_position INT, -- Char position in parent
end_position INT
);
CREATE INDEX idx_parent ON manuscript_nodes(parent_id);
CREATE INDEX idx_level ON manuscript_nodes(hierarchy_level);
This structure:
- Uses
parent_id
for direct hierarchy navigation - Enforces valid hierarchy levels with a CHECK constraint
- Stores text positions for reconstruction
- Separates grammatical classification into an optional column
Step 2: Normalize Grammatical Metadata
Create reference tables for grammatical terms to enable joins without redundant data:
CREATE TABLE grammar_classes (
grammar_id INTEGER PRIMARY KEY,
class_name TEXT UNIQUE NOT NULL,
attributes JSON -- Stores gender, number, etc.
);
INSERT INTO grammar_classes (class_name, attributes)
VALUES
('preposition', '{"requires_case": "accusative"}'),
('noun', '{"gender": "masculine", "number": "singular"}');
Modify the main table to reference these:
ALTER TABLE manuscript_nodes ADD COLUMN grammar_id INTEGER REFERENCES grammar_classes(grammar_id);
Step 3: Utilize Closure Tables for Ancestor Tracking
Create a closure table to precompute all ancestor-descendant relationships:
CREATE TABLE manuscript_closure (
ancestor_id INTEGER NOT NULL REFERENCES manuscript_nodes(node_id),
descendant_id INTEGER NOT NULL REFERENCES manuscript_nodes(node_id),
depth INTEGER NOT NULL,
PRIMARY KEY (ancestor_id, descendant_id)
);
Populate it via triggers or batch processing after inserts. This enables instant queries like "find all parsed particles under split X" without recursive CTEs:
SELECT desc.*
FROM manuscript_closure c
JOIN manuscript_nodes desc ON c.descendant_id = desc.node_id
WHERE c.ancestor_id = ?split_node_id
AND desc.hierarchy_level = 'parsed';
Step 4: Optimize Indexing Strategy
Create composite indexes aligned with common query patterns:
-- For traversing children
CREATE INDEX idx_children ON manuscript_nodes(parent_id, hierarchy_level);
-- For grammatical queries across hierarchy
CREATE INDEX idx_grammar_query ON manuscript_nodes(grammar_id, hierarchy_level)
WHERE grammar_id IS NOT NULL;
Step 5: Implement Recursive Query Guards
Prevent infinite loops in recursive CTEs with depth limits:
WITH RECURSIVE word_tree AS (
SELECT node_id, parent_id, original_text, grammar_id, 0 AS depth
FROM manuscript_nodes
WHERE node_id = ?start_node
UNION ALL
SELECT n.node_id, n.parent_id, n.original_text, n.grammar_id, wt.depth + 1
FROM manuscript_nodes n
JOIN word_tree wt ON n.parent_id = wt.node_id
WHERE depth < 5 -- Prevent excessive depth
)
SELECT * FROM word_tree;
Step 6: Leverage SQLite JSON1 Extension
Store grammatical attributes as JSON for flexible querying:
SELECT n.original_text, g.class_name, json_extract(g.attributes, '$.gender')
FROM manuscript_nodes n
JOIN grammar_classes g USING (grammar_id)
WHERE json_extract(g.attributes, '$.gender') = 'masculine'
AND n.hierarchy_level = 'parsed';
Step 7: Transactional Hierarchy Modifications
Use atomic transactions when modifying tree structures to maintain closure table consistency:
BEGIN TRANSACTION;
-- Insert new parsed node
INSERT INTO manuscript_nodes (parent_id, hierarchy_level, original_text, grammar_id)
VALUES (?split_node_id, 'parsed', 'in', ?preposition_id);
-- Update closure table
INSERT INTO manuscript_closure (ancestor_id, descendant_id, depth)
SELECT c.ancestor_id, last_insert_rowid(), c.depth + 1
FROM manuscript_closure c
WHERE c.descendant_id = ?split_node_id
UNION ALL
VALUES (last_insert_rowid(), last_insert_rowid(), 0);
COMMIT;
Step 8: Materialized Views for Common Queries
Precompute frequently accessed hierarchy views using triggers:
CREATE TABLE mv_parsed_prepositions (
node_id INTEGER PRIMARY KEY,
preposition_text TEXT,
parent_split TEXT,
manuscript_root TEXT,
FOREIGN KEY(node_id) REFERENCES manuscript_nodes(node_id)
);
-- Refresh on data change
CREATE TRIGGER trg_mv_prepositions AFTER INSERT ON manuscript_nodes
WHEN NEW.grammar_id = (SELECT grammar_id FROM grammar_classes WHERE class_name = 'preposition')
BEGIN
INSERT INTO mv_parsed_prepositions
SELECT n.node_id, n.original_text, s.original_text, m.original_text
FROM manuscript_nodes n
JOIN manuscript_nodes s ON n.parent_id = s.node_id
JOIN manuscript_nodes m ON s.parent_id = m.node_id
WHERE n.node_id = NEW.node_id;
END;
Step 9: Proximity Search Using Window Functions
Find adjacent parsed particles within splits:
SELECT
curr.original_text AS current_word,
next.original_text AS next_word,
curr.grammar_id AS current_grammar,
next.grammar_id AS next_grammar
FROM manuscript_nodes curr
JOIN manuscript_nodes next ON curr.parent_id = next.parent_id
AND curr.node_id = next.node_id - 1 -- Assumes contiguous insertion order
WHERE curr.hierarchy_level = 'parsed'
AND next.hierarchy_level = 'parsed'
AND curr.grammar_id = (SELECT grammar_id FROM grammar_classes WHERE class_name = 'preposition')
AND json_extract((SELECT attributes FROM grammar_classes WHERE grammar_id = next.grammar_id), '$.gender') = 'masculine';
Step 10: Benchmarking and Index Tuning
Analyze query plans for critical operations:
EXPLAIN QUERY PLAN
SELECT *
FROM manuscript_closure
WHERE ancestor_id IN (
SELECT node_id
FROM manuscript_nodes
WHERE original_text = 'blood'
AND hierarchy_level = 'parsed'
);
Use covering indexes to eliminate table scans:
CREATE INDEX idx_closure_covering ON manuscript_closure(ancestor_id, depth, descendant_id);
Final Schema Validation Checklist
- All parent_id values reference existing node_ids (enabled via FOREIGN KEY)
- Closure table depth matches actual tree depth (verify via recursive COUNT)
- Grammatical constraints are applied only to ‘parsed’ level nodes (CHECK constraint)
- Positional columns (start/end) are consistent with parent text length
- Materialized views refresh atomically with base table changes
This approach balances normalization for grammatical metadata with denormalized hierarchy performance optimizations. By separating structural relationships (closure table) from entity properties (grammar classes), queries can efficiently traverse manuscript structures while maintaining linguistic accuracy. The use of SQLite’s JSON1 extension allows flexible querying of grammatical attributes without schema modifications, crucial for handling the evolving requirements of Hebrew manuscript analysis.