STORED Keyword Ambiguity in SQLite Generated Columns

Issue Overview: STORED Omission from SQLite Keywords Documentation

The core issue revolves around the absence of the term ‘STORED’ in SQLite’s official list of keywords, despite its usage in the syntax for generated columns. Generated columns in SQLite allow developers to define columns whose values are computed from expressions. These columns can be declared as either ‘VIRTUAL’ (computed on-the-fly during reads) or ‘STORED’ (persisted to disk during writes). While ‘VIRTUAL’ appears in the keywords documentation, ‘STORED’ does not. This omission creates ambiguity for users, tooling developers, and documentation maintainers.

Technical Context of Generated Columns

SQLite introduced generated columns in version 3.31.0 (2020-01-22). The syntax for defining them includes an optional STORED or VIRTUAL clause after the column’s expression. For example:

CREATE TABLE t1 (
  a INTEGER PRIMARY KEY,
  b INTEGER,
  c INTEGER GENERATED ALWAYS AS (a * b) STORED
);

Here, STORED specifies that the value is computed once during insertion and stored persistently. The parser accepts STORED or VIRTUAL as part of the column definition, but these terms are not treated as keywords in the lexical analyzer (tokenizer). Instead, they are parsed as identifiers and validated during semantic analysis. This design choice avoids expanding the keyword list but introduces inconsistencies in documentation and tooling.

Documentation Discrepancy

The lang_keywords.html page lists all SQLite keywords, categorizing them as reserved or unreserved. Reserved keywords cannot be used as identifiers without escaping, while unreserved keywords have context-dependent restrictions. The absence of ‘STORED’ here implies it is not a keyword, contradicting its syntactic role in generated columns. This discrepancy confuses users who expect ‘STORED’ to follow keyword rules and tooling developers who rely on accurate keyword lists for syntax highlighting, linting, or code generation.

Impact on Tooling and Development

External tools like linters, ORMs, and IDEs depend on SQLite’s keyword documentation to implement accurate parsing logic. When ‘STORED’ is omitted, these tools may fail to recognize it as a reserved syntactic element, leading to false positives in syntax validation or incorrect autocompletion. For instance, a linter might flag STORED as an invalid identifier in a generated column definition, despite it being valid syntax. This creates friction for developers who must reference forum threads or source code to resolve tooling issues.

Possible Causes: Lexical Parsing vs. Semantic Validation

Tokenizer-Level Keyword Handling

SQLite’s parser uses a two-phase process: lexical analysis (tokenization) and syntactic parsing. The tokenizer maps input text to tokens using a predefined keyword list. Keywords like SELECT or CREATE are recognized during this phase and assigned specific token codes. However, ‘STORED’ is not part of this list. The tokenizer treats it as an identifier (ID token), passing it to the syntactic parser without special handling. This explains its absence from the keywords documentation, as the documentation reflects the tokenizer’s keyword list.

Grammar Rules for Generated Columns

The syntactic parser uses grammar rules to interpret token sequences. For generated columns, the grammar allows an optional identifier after the expression:

generated ::= LP expr(E) RP ID(TYPE). {sqlite3AddGenerated(pParse,E,&TYPE);}

Here, ID(TYPE) captures the optional ‘VIRTUAL’ or ‘STORED’ clause. The parser treats these identifiers as ordinary strings and forwards them to the sqlite3AddGenerated() function. This function performs semantic validation, checking if the identifier is ‘VIRTUAL’ or ‘STORED’ and raising an error otherwise. Thus, ‘STORED’ is enforced during semantic analysis rather than lexical tokenization.

Documentation Policy for Contextual Keywords

SQLite’s documentation policy for keywords is based on whether a term is recognized by the tokenizer. Terms like VIRTUAL (used in CREATE VIRTUAL TABLE) are listed because they are tokenized as keywords. However, ‘STORED’ is validated post-tokenization, creating a gray area. The documentation does not account for context-specific identifiers that act like keywords in certain syntax positions. This policy leads to underdocumentation of terms that are semantically required but lexically unrestricted.

Historical Precedents and Comparisons

Other SQLite features exhibit similar behavior. For example, AUTOINCREMENT is a keyword in CREATE TABLE but not listed as reserved. However, its usage is enforced during semantic checks. This inconsistency suggests a broader pattern where syntactic elements are validated post-parsing, creating documentation gaps. The lack of a formal category for "contextual keywords" exacerbates confusion, as users expect all syntax-enforced terms to appear in the keyword list.

Troubleshooting Steps, Solutions & Fixes

For Users: Workarounds and Best Practices

Treating STORED as a De Facto Keyword

Even though ‘STORED’ is not officially a keyword, developers should treat it as unreserved in generated column definitions. Avoid using ‘STORED’ as an identifier (e.g., column or table names) to prevent conflicts. If escaping is necessary, use double-quotes or square brackets:

CREATE TABLE t2 (
  "STORED" INTEGER,  -- Legal but discouraged
  b INTEGER GENERATED ALWAYS AS (a * 2) STORED
);

Documentation Cross-Referencing

When encountering tooling errors related to ‘STORED’, consult the Generated Columns documentation instead of the keyword list. This page explicitly mentions ‘STORED’ as a valid clause, providing authoritative guidance despite the keyword list omission.

For Tooling Developers: Adjusting Parsers and Linters

Augmenting Keyword Lists

Tools should include ‘STORED’ in their SQLite keyword lists for syntax highlighting and validation, even if it is absent from the official documentation. Implement context-aware parsing where ‘STORED’ is recognized only in generated column definitions. For example:

# Pseudocode for a context-aware linter
def validate_generated_column(clause):
    if clause.expression and clause.storage_type:
        if clause.storage_type.value.lower() not in ('stored', 'virtual'):
            raise SyntaxError(f"Invalid storage type: {clause.storage_type}")

Leveraging SQLite’s Grammar Files

SQLite’s parse.y grammar file (part of the source code) defines the syntactic rules for generated columns. Tooling can reference this file to extract context-specific keywords:

// Excerpt from parse.y
ccons ::= GENERATED ALWAYS AS generated.
generated ::= LP expr(E) RP ID(TYPE). {sqlite3AddGenerated(pParse,E,&TYPE);}

By analyzing the grammar, tools can identify positions where ‘STORED’ is expected and adjust their logic accordingly.

For Documentation Maintainers: Clarifying Keyword Status

Proposing Documentation Updates

The SQLite documentation should classify ‘STORED’ as a contextual keyword with a note explaining its role in generated columns. A pull request or patch could modify lang_keywords.html as follows:

<h3>Contextual Keywords</h3>
<p>The following terms act as keywords in specific SQL constructs but are not reserved:
  <ul>
    <li>STORED (used in generated column definitions)</li>
    <li>AUTOINCREMENT (used in CREATE TABLE)</li>
  </ul>
</p>

Differentiating Reserved and Unreserved Keywords

Introduce visual markers in the keyword list to distinguish reserved and unreserved keywords. For example, an asterisk (*) could denote reserved keywords, while contextual keywords appear in a separate table. This aligns with practices in other SQL dialects (e.g., PostgreSQL’s Key Words appendix).

For SQLite Developers: Addressing the Ambiguity

Tokenizer Modification

Adding ‘STORED’ to the tokenizer’s keyword list would resolve the documentation discrepancy. This requires modifying the mkkeywordhash tool to include ‘STORED’ and regenerating the keyword hash table. However, this approach might inflate the keyword list for a term used in only one context.

Grammar Rule Refinement

Alternatively, the parser could treat ‘STORED’ as a keyword exclusively in generated column contexts. This involves creating a new token type (e.g., TK_STORED) and updating the grammar rules to expect this token:

generated ::= LP expr(E) RP STORED. {sqlite3AddGenerated(pParse,E,STORED);}

This targeted approach minimizes impact on the overall keyword list while ensuring ‘STORED’ appears in documentation.

Long-Term Considerations

Formalizing Contextual Keywords

SQLite could introduce a formal category for contextual keywords, documenting terms that are enforced semantically rather than lexically. This would improve clarity for users and tooling developers, reducing ambiguity in future syntax extensions.

Community-Driven Documentation Annotations

Allow community contributions to document edge cases and contextual keywords via footnotes or annotations. This decentralized approach would help keep the documentation accurate without overburdening core maintainers.

Conclusion

The absence of ‘STORED’ from SQLite’s keyword list stems from its validation during semantic analysis rather than lexical tokenization. While this design avoids keyword list inflation, it creates documentation gaps affecting users and tooling. Short-term workarounds include treating ‘STORED’ as an unreserved keyword and updating tooling logic. Long-term solutions involve documentation clarifications or parser modifications to formalize ‘STORED’ as a contextual keyword. By addressing these layers, the SQLite ecosystem can reduce ambiguity and improve developer experience.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *