JSON Grammar Specification

A comprehensive specification for defining PEG (Parsing Expression Grammar) parsers using JSON syntax. This specification provides a portable way of specifying grammars that can be easily implemented in different parsing libraries and programming languages.

Introduction
Rationale and Portability
Grammar File Structure
Grammar Node Types
AST Conversion
Complete Examples
Best Practices

Introduction

The JSON Grammar specification defines how to create parsing expression grammars using JSON syntax. This specification enables language-agnostic grammar definitions that can be serialized, shared, and processed across different platforms and programming languages.

A major advantage of JSON Grammar is that it can specify CST to AST transformation fully in JSON, which is "super" portable and does not require any code injections in the programming language where the parser is generated.

JSON-based grammars leverage the power of PEG (Parsing Expression Grammar) parsing, which provides top-down recursive descent backtracking capabilities. The grammars can be compiled into efficient parsing functions at runtime, generating both Concrete Syntax Trees (CST) and Abstract Syntax Trees (AST) from textual input.

Rationale and Portability

Why JSON-based Grammar?

Language Agnostic: JSON is universally supported across programming languages, making grammars portable between different implementations.
Serialization: Grammar definitions can be easily stored, transmitted, and version-controlled as standard JSON files.
Tooling Support: Leverage existing JSON validation, editing, and processing tools for grammar development.
Interoperability: Enable grammar sharing between different parser generators and language ecosystems.
Runtime Configuration: Dynamically load and modify grammar definitions without code changes.

Grammar File Structure

A JSON grammar file defines a complete parsing grammar with the following top-level structure:

interface Grammar {
  start: string;                             // Entry point rule name
  cst: Record<string, GrammarNode>;          // Concrete syntax rules
  ast?: Record<string, AstNodeExpression>;   // AST transformation rules
}

Basic Example

{
  "start": "Value",
  "cst": {
    "Value": {"r": "Number"},
    "Number": {"t": "/\\d+/"}
  },
  "ast": {
    "Number": ["num", ["$", "/raw"]]
  }
}

Structure Components

Start Symbol

start: String specifying the root grammar rule name
Must reference a rule defined in the cst section
Determines the entry point for parsing

CST Rules

cst: Object mapping rule names to grammar node definitions
Contains all named grammar rules that can be referenced
Supports all five grammar node types (Reference, Terminal, Production, Union, List)

AST Transformations

ast: Optional object defining custom AST generation rules
Maps rule names to JSON expressions for AST transformation
Overrides default AST generation behavior

Grammar Node Types

The JSON grammar specification supports five fundamental node types for defining parsing rules. Each type has a specific JSON representation with both full interface and shorthand syntax options.

1. RefNode (Reference Node)

References a named rule defined elsewhere in the grammar.

Interface

type RefNode<Name extends string = string> = {
  r: Name;
};

Syntax

{"r": "RuleName"}

Examples

{
  "start": "Program",
  "cst": {
    "Program": {"r": "Statement"},
    "Statement": "return;"
  }
}

// Matches:
// return;

2. TerminalNode (Terminal Node)

Matches literal strings, regular expressions, or arrays of alternatives. Terminal nodes are leaf nodes in the parse tree.

Interface

interface TerminalNode {
  type?: string;                           // Type name (default: "Text")
  t: RegExp | string | '' | string[];      // Pattern(s) to match
  repeat?: '*' | '+';                      // Repetition (only for string arrays)
  sample?: string;                         // Sample text for generation
  ast?: AstNodeExpression;                 // AST transformation
}

// Shorthand: string, RegExp, or empty string
type TerminalNodeShorthand = RegExp | string | '';

Syntax Options

String Literal:

"hello"  // Matches exactly: hello

Regular Expression:

{"t": "/[a-z]+/"}  // Matches: abc, hello, test

Note: Regular expressions in JSON must be represented as objects with a t property containing the regex pattern as a string.

Array of Alternatives:

{"t": ["true", "false"]}  // Matches: true OR false

With Repetition:

{"t": [" ", "\t", "\n"], "repeat": "*"}  // Matches: any whitespace sequence

Full Terminal Node:

{
  "t": "/\\d+/",           // Matches: 123, 456, 7890
  "type": "Number",
  "sample": "123",
  "ast": ["num", ["$", "/raw"]]
}

Important Notes

Regular Expression Syntax:

In TypeScript/JavaScript, regex can be written as /pattern/flags
In JSON, regex must be a string: {"t": "/pattern/flags"}
Escape characters must be double-escaped in JSON strings: "\\d+" instead of \d+

Repetition Patterns:

repeat: "*" means zero or more matches (equivalent to regex *)
repeat: "+" means one or more matches (equivalent to regex +)
Only applicable when t is an array of strings

Examples

{
  "cst": {
    "Null": "null",                          // Matches: null
    "Number": "/\\-?\\d+(\\.\\d+)?/",       // Matches: 123, -45.67, 0.5
    "Boolean": {"t": ["true", "false"]},     // Matches: true OR false
    "Whitespace": {"t": [" ", "\t", "\n"], "repeat": "*"},  // Matches: any whitespace
    "Identifier": {
      "t": "/[a-zA-Z_][a-zA-Z0-9_]*/",      // Matches: varName, _temp, MY_CONST
      "type": "Identifier",
      "sample": "variable_name"
    }
  }
}

3. ProductionNode (Production Node)

Matches a sequence of grammar nodes in order. All nodes in the sequence must match for the production to succeed.

Interface

interface ProductionNode {
  p: GrammarNode[];                    // Sequence of nodes to match
  type?: string;                       // Type name (default: "Production")
  children?: Record<number, string>;   // Child index to property mapping
  ast?: AstNodeExpression;             // AST transformation
}

// Shorthand: array of grammar nodes
type ProductionNodeShorthand = GrammarNode[];

Syntax Options

Shorthand Array:

["{", {"r": "Content"}, "}"]  // Matches: { content }

Example

{
  "cst": {
    "FunctionCall": ["identifier", "(", {"r": "Arguments"}, ")"],  // Matches: func(args)
    "Assignment": [{"r": "Variable"}, "=", {"r": "Expression"}]     // Matches: x = value
  }
}

Full Production Node:

{
  "p": ["{", {"r": "Content"}, "}"],  // Matches: { content }
  "type": "Block",
  "children": {
    "1": "content"
  }
}

More Examples

{
  "cst": {
    "FunctionCall": ["func", "(", ")"],                              // Matches: func()
    "Assignment": {
      "p": [{"r": "Identifier"}, "=", {"r": "Expression"}],        // Matches: x = 5
      "type": "Assignment",
      "children": {
        "0": "target",
        "2": "value"
      }
    },
    "IfStatement": {
      "p": ["if", "(", {"r": "Expression"}, ")", {"r": "Statement"}],  // Matches: if (x) stmt
      "children": {
        "2": "condition",
        "4": "body"
      }
    }
  }
}

4. UnionNode (Union Node)

Matches one of several alternative patterns. The first matching alternative is selected (ordered choice).

Interface

interface UnionNode {
  u: GrammarNode[];           // Array of alternative nodes
  type?: string;              // Type name (default: "Union")
  ast?: AstNodeExpression;    // AST transformation
}

Syntax

{
  "u": ["pattern1", "pattern2", "pattern3"]  // Matches: pattern1 OR pattern2 OR pattern3
}

Examples

{
  "cst": {
    "Literal": {
      "u": ["null", "true", "false", {"r": "Number"}, {"r": "String"}]  // Matches any literal type
    },
    "Statement": {
      "u": [                                                              // Matches any statement type
        {"r": "IfStatement"},
        {"r": "ReturnStatement"},
        {"r": "ExpressionStatement"}
      ]
    },
    "BinaryOperator": {
      "u": ["+", "-", "*", "/", "==", "!=", "<", ">"]                   // Matches any operator
    }
  }
}

5. ListNode (List Node)

Matches zero or more repetitions of a pattern.

Interface

interface ListNode {
  l: GrammarNode;              // Node to repeat
  type?: string;               // Type name (default: "List")
  ast?: AstNodeExpression;     // AST transformation
}

Syntax

{
  "l": "pattern"  // Matches: zero or more occurrences of pattern
}

Examples

{
  "cst": {
    "Statements": {
      "l": {"r": "Statement"}                                           // Matches: multiple statements
    },
    "Parameters": {
      "l": {                                                            // Matches: param1, param2, param3
        "p": [",", {"r": "Parameter"}],
        "ast": ["$", "/children/1"]
      }
    },
    "Digits": {
      "l": "/[0-9]/",                                                   // Matches: 123456789
      "type": "DigitSequence"
    }
  }
}

AST Conversion

The Abstract Syntax Tree (AST) conversion process transforms the detailed Concrete Syntax Tree (CST) into a simplified, semantically meaningful tree structure suitable for further processing. A major advantage is that all transformations are specified purely in JSON, making them completely portable without requiring any language-specific code injections.

Default AST Generation

When no custom AST transformation is specified, the library generates a canonical AST node:

interface CanonicalAstNode {
  type: string;                                    // Node type
  pos: number;                                     // Start position
  end: number;                                     // End position
  raw?: string;                                    // Raw matched text (the actual text that was matched)
  children?: (CanonicalAstNode | unknown)[];       // Child nodes
}

The raw property contains the actual text that was matched by the grammar node, providing access to the original input text for that specific node.

Simplified AST Construction with `children`

The simplest way to customize AST generation is using the children property mapping in grammar nodes. This allows you to specify which child nodes should be included in the AST and assign them meaningful property names:

{
  "cst": {
    "Assignment": {
      "p": [{"r": "Variable"}, "=", {"r": "Expression"}],
      "children": {
        "0": "target",    // First child becomes "target" property
        "2": "value"      // Third child becomes "value" property
      }
    }
  }
}

AST Expression System

AST transformations use JSON Expression syntax for powerful, declarative tree transformations. This system is completely portable since it's all JSON - no language-specific transformations or code injections are needed.

JSON Expression Transformation Process

The transformation happens as follows:

Default AST Creation: First, a default (canonical) AST node is created from the CST
JSON Expression Application: Then a JSON Expression is applied to that node, allowing modification or extraction of specific parts
Result Generation: The resulting JSON from the JSON Expression evaluation becomes the final AST node
Bottom-up Processing: The process happens bottom-up, with resulting AST nodes supplied to parent node JSON Expression transformations as part of their children array
Children Array: The children array is the canonical way to specify all children in both CST and AST

Basic AST Expressions

Skip Node:

{"ast": null}

Use Child Node:

{"ast": ["$", "/children/0"]}

Custom Object:

{
  "ast": {
    "type": "CustomType",
    "value": ["$", "/raw"]
  }
}

JSON Expression Operations

Value Extraction:

["$", "/raw"]              // Get raw matched text
["$", "/children/0"]       // Get first child's AST
["$", "/pos"]              // Get start position
["$", "/end"]              // Get end position

Type Conversion:

["num", ["$", "/raw"]]     // Convert to number
["bool", ["$", "/raw"]]    // Convert to boolean

// Example transformation:
// Input CST: {type: "Number", raw: "42", pos: 0, end: 2}
// Transform: ["num", ["$", "/raw"]]
// Output AST: 42

String Operations:

["substr", ["$", "/raw"], 1, -1]  // Remove first and last character
["len", ["$", "/raw"]]            // Get string length

// Example transformation:
// Input CST: {type: "String", raw: "\"hello\"", pos: 0, end: 7}
// Transform: ["substr", ["$", "/raw"], 1, -1]
// Output AST: "hello"

Array Operations:

["push", [[]], ["$", "/children/0"]]    // Create array with element
["concat", ["$", "/children/0"], ["$", "/children/1"]]  // Concatenate arrays

Conditional Logic:

["?", ["==", ["$", "/raw"], "true"], true, false]  // Ternary operator

Object Construction:

["o.set", ["$", ""], "key", ["$", "/children/0"]]  // Set object property

Children Mapping

Use the children property to map CST child indices to AST properties:

{
  "p": [{"r": "Key"}, ":", {"r": "Value"}],
  "children": {
    "0": "key",
    "2": "value"
  }
}

This creates an AST node with key and value properties instead of a children array.

Complex AST Examples

Boolean Conversion

{
  "cst": {
    "Boolean": {"t": ["true", "false"]}
  },
  "ast": {
    "Boolean": ["==", ["$", "/raw"], "true"]
  }
}

Number Conversion

{
  "cst": {
    "Number": "/\\d+/"
  },
  "ast": {
    "Number": ["num", ["$", "/raw"]]
  }
}

String Unescaping

{
  "cst": {
    "String": "/\"[^\"]*\"/"
  },
  "ast": {
    "String": ["substr", ["$", "/raw"], 1, -1]
  }
}

Array Collection

{
  "cst": {
    "Array": ["[", {"r": "Elements"}, "]"]
  },
  "ast": {
    "Array": ["$", "/children/1"]
  }
}

List Flattening

{
  "cst": {
    "CommaSeparated": {
      "p": [
        {"r": "Item"},
        {
          "l": {
            "p": [",", {"r": "Item"}],
            "ast": ["$", "/children/1"]
          }
        }
      ],
      "ast": ["concat", ["push", [[]], ["$", "/children/0"]], ["$", "/children/1"]]
    }
  }
}

Complete Examples

Simple Calculator Parser

A basic calculator that handles addition and multiplication with proper precedence:

{
  "start": "Expression",
  "cst": {
    "Expression": {
      "p": [{"r": "Term"}, {"l": {"p": [{"r": "AddOp"}, {"r": "Term"}]}}],
      "ast": ["foldl", ["$", "/children/0"], ["$", "/children/1"]]
    },
    "Term": {
      "p": [{"r": "Factor"}, {"l": {"p": [{"r": "MulOp"}, {"r": "Factor"}]}}],
      "ast": ["foldl", ["$", "/children/0"], ["$", "/children/1"]]
    },
    "Factor": {
      "u": [
        {"r": "Number"},
        {"p": ["(", {"r": "Expression"}, ")"], "ast": ["$", "/children/1"]}
      ]
    },
    "Number": "/\\d+/",
    "AddOp": {"u": ["+", "-"]},
    "MulOp": {"u": ["*", "/"]}
  },
  "ast": {
    "Number": ["num", ["$", "/raw"]],
    "AddOp": ["$", "/raw"],
    "MulOp": ["$", "/raw"]
  }
}

JSON Parser

A complete JSON parser grammar:

{
  "start": "Value",
  "cst": {
    "WOpt": {"t": [" ", "\n", "\t", "\r"], "repeat": "*", "ast": null},
    "Value": [{"r": "WOpt"}, {"r": "TValue"}, {"r": "WOpt"}],
    "TValue": {
      "u": [
        {"r": "Null"},
        {"r": "Boolean"},
        {"r": "String"},
        {"r": "Object"},
        {"r": "Array"},
        {"r": "Number"}
      ]
    },
    "Null": "null",
    "Boolean": {"t": ["true", "false"]},
    "Number": "/\\-?(0|([1-9][0-9]*))(\\.\\d+)?([eE][\\+\\-]?\\d+)?/",
    "String": "/\"[^\"\\\\]*(?:\\\\.|[^\"\\\\]*)*\"/",
    "Array": ["[", {"r": "Elements"}, "]"],
    "Elements": {
      "u": [
        {
          "p": [
            {"r": "Value"},
            {
              "l": {
                "p": [",", {"r": "Value"}],
                "ast": ["$", "/children/1"]
              }
            }
          ],
          "ast": ["concat", ["push", [[]], ["$", "/children/0"]], ["$", "/children/1"]]
        },
        {"r": "WOpt"}
      ]
    },
    "Object": ["{", {"r": "Members"}, "}"],
    "Members": {
      "u": [
        {
          "p": [
            {"r": "Entry"},
            {
              "l": {
                "p": [",", {"r": "Entry"}],
                "ast": ["$", "/children/1"]
              }
            }
          ],
          "ast": ["concat", ["push", [[]], ["$", "/children/0"]], ["$", "/children/1"]]
        },
        {"r": "WOpt"}
      ]
    },
    "Entry": {
      "p": [{"r": "WOpt"}, {"r": "String"}, {"r": "WOpt"}, ":", {"r": "Value"}],
      "children": {
        "1": "key",
        "4": "value"
      }
    }
  },
  "ast": {
    "Value": ["$", "/children/1"],
    "Boolean": ["==", ["$", "/raw"], "true"],
    "Number": ["num", ["$", "/raw"]],
    "String": ["substr", ["$", "/raw"], 1, -1],
    "Array": ["$", "/children/1"],
    "Object": ["$", "/children/1"],
    "Elements": ["?", ["len", ["$", "/children"]], ["$", "/children/0"], [[]]],
    "Members": ["?", ["len", ["$", "/children"]], ["$", "/children/0"], [[]]]
  }
}

Best Practices

Grammar Design

Start Simple: Begin with basic rules and gradually add complexity
Use Meaningful Names: Choose descriptive rule names that reflect their purpose
Leverage Shortcuts: Use shorthand syntax where appropriate for cleaner grammars
Whitespace Handling: Create dedicated whitespace rules with ast: null for clean ASTs
Left Recursion: Avoid left-recursive rules; use right-recursion with lists instead

Performance Optimization

Order Alternatives: Place most common alternatives first in union nodes
Minimize Backtracking: Design grammars to reduce ambiguity
Atomic Groups: Use terminal nodes for performance-critical patterns
Sample Data: Provide sample strings for testing and validation

AST Design

Semantic Focus: Include only semantically meaningful information in ASTs
Consistent Structure: Maintain consistent AST node shapes across similar constructs
Type Safety: Use clear, descriptive type names for AST nodes
Flatten Lists: Transform complex nested structures into simpler forms

Error Handling

Debug Mode: Use debug compilation for grammar development
Incremental Testing: Test grammar rules individually before combining
Trace Analysis: Leverage debug traces to understand parse failures
Sample Validation: Verify grammars against known good and bad inputs

Debugging and Validation

Debug Mode

Debug traces can be captured during grammar development to understand parsing behavior. Debug trace nodes typically form a tree structure that mirrors the grammar execution, allowing developers to:

Trace Execution: Follow the parser's decision-making process through the grammar
Identify Failures: Pinpoint where parsing fails and why certain rules don't match
Performance Analysis: Understand which rules are expensive or cause excessive backtracking
Grammar Validation: Verify that the grammar behaves as expected on test inputs

Debug trace nodes generally contain information about:

Rule name being executed
Input position and matched text
Success or failure status
Child trace nodes for nested rule calls
Backtracking information

Grammar Inspection

Most JSON Grammar implementations provide utilities to:

Print Grammar Structure: Display the grammar in a human-readable format
Visualize Parse Trees: Show the concrete syntax tree structure
Export Grammar Metadata: Generate documentation or schema information
Validate Grammar Rules: Check for common issues like left recursion or unreachable rules

Maintainability

Documentation: Include comments and examples in grammar files
Modularity: Break complex grammars into logical sections
Version Control: Track grammar evolution through version control
Testing: Maintain comprehensive test suites for grammar rules

Common Patterns

Optional Elements

{"u": [{"r": "Element"}, ""]}  // Matches: Element OR nothing

Comma-Separated Lists

{
  "u": [
    {
      "p": [                                     // Matches: item1, item2, item3
        {"r": "Item"},
        {"l": {"p": [",", {"r": "Item"}], "ast": ["$", "/children/1"]}}
      ],
      "ast": ["concat", ["push", [[]], ["$", "/children/0"]], ["$", "/children/1"]]
    },
    ""                                           // OR empty list
  ]
}

Identifier with Keywords

{
  "cst": {
    "Identifier": {
      "u": [                                     // Matches: any identifier except keywords
        {"r": "Keyword"},
        "/[a-zA-Z_][a-zA-Z0-9_]*/"
      ]
    },
    "Keyword": {
      "u": ["if", "else", "while", "for", "return"]  // Matches: reserved keywords
    }
  }
}

Uh oh!

FilesExpand file tree

JSON-Grammar-spec.md

Latest commit

History

JSON-Grammar-spec.md

File metadata and controls

JSON Grammar Specification

Table of Contents

Introduction

Rationale and Portability

Why JSON-based Grammar?

Grammar File Structure

Basic Example

Structure Components

Start Symbol

CST Rules

AST Transformations

Grammar Node Types

1. RefNode (Reference Node)

Interface

Syntax

Examples

2. TerminalNode (Terminal Node)

Interface

Syntax Options

Important Notes

Examples

3. ProductionNode (Production Node)

Interface

Syntax Options

Example

More Examples

4. UnionNode (Union Node)

Interface

Syntax

Examples

5. ListNode (List Node)

Interface

Syntax

Examples

AST Conversion

Default AST Generation

Simplified AST Construction with children

AST Expression System

JSON Expression Transformation Process

Basic AST Expressions

JSON Expression Operations

Children Mapping

Complex AST Examples

Boolean Conversion

Number Conversion

String Unescaping

Array Collection

List Flattening

Complete Examples

Simple Calculator Parser

JSON Parser

Best Practices

Grammar Design

Performance Optimization

AST Design

Error Handling

Debugging and Validation

Debug Mode

Grammar Inspection

Maintainability

Common Patterns

Optional Elements

Comma-Separated Lists

Identifier with Keywords

Simplified AST Construction with `children`