A comprehensive specification for defining PEG (Parsing Expression Grammar) parsers using JSON syntax. This specification provides a portable way of specifying grammars that can be easily implemented in different parsing libraries and programming languages.
- Introduction
- Rationale and Portability
- Grammar File Structure
- Grammar Node Types
- AST Conversion
- Complete Examples
- Best Practices
The JSON Grammar specification defines how to create parsing expression grammars using JSON syntax. This specification enables language-agnostic grammar definitions that can be serialized, shared, and processed across different platforms and programming languages.
A major advantage of JSON Grammar is that it can specify CST to AST transformation fully in JSON, which is "super" portable and does not require any code injections in the programming language where the parser is generated.
JSON-based grammars leverage the power of PEG (Parsing Expression Grammar) parsing, which provides top-down recursive descent backtracking capabilities. The grammars can be compiled into efficient parsing functions at runtime, generating both Concrete Syntax Trees (CST) and Abstract Syntax Trees (AST) from textual input.
-
Language Agnostic: JSON is universally supported across programming languages, making grammars portable between different implementations.
-
Serialization: Grammar definitions can be easily stored, transmitted, and version-controlled as standard JSON files.
-
Tooling Support: Leverage existing JSON validation, editing, and processing tools for grammar development.
-
Interoperability: Enable grammar sharing between different parser generators and language ecosystems.
-
Runtime Configuration: Dynamically load and modify grammar definitions without code changes.
A JSON grammar file defines a complete parsing grammar with the following top-level structure:
interface Grammar {
start: string; // Entry point rule name
cst: Record<string, GrammarNode>; // Concrete syntax rules
ast?: Record<string, AstNodeExpression>; // AST transformation rules
}{
"start": "Value",
"cst": {
"Value": {"r": "Number"},
"Number": {"t": "/\\d+/"}
},
"ast": {
"Number": ["num", ["$", "/raw"]]
}
}start: String specifying the root grammar rule name- Must reference a rule defined in the
cstsection - Determines the entry point for parsing
cst: Object mapping rule names to grammar node definitions- Contains all named grammar rules that can be referenced
- Supports all five grammar node types (Reference, Terminal, Production, Union, List)
ast: Optional object defining custom AST generation rules- Maps rule names to JSON expressions for AST transformation
- Overrides default AST generation behavior
The JSON grammar specification supports five fundamental node types for defining parsing rules. Each type has a specific JSON representation with both full interface and shorthand syntax options.
References a named rule defined elsewhere in the grammar.
type RefNode<Name extends string = string> = {
r: Name;
};{"r": "RuleName"}{
"start": "Program",
"cst": {
"Program": {"r": "Statement"},
"Statement": "return;"
}
}
// Matches:
// return;Matches literal strings, regular expressions, or arrays of alternatives. Terminal nodes are leaf nodes in the parse tree.
interface TerminalNode {
type?: string; // Type name (default: "Text")
t: RegExp | string | '' | string[]; // Pattern(s) to match
repeat?: '*' | '+'; // Repetition (only for string arrays)
sample?: string; // Sample text for generation
ast?: AstNodeExpression; // AST transformation
}
// Shorthand: string, RegExp, or empty string
type TerminalNodeShorthand = RegExp | string | '';String Literal:
"hello" // Matches exactly: helloRegular Expression:
{"t": "/[a-z]+/"} // Matches: abc, hello, testNote: Regular expressions in JSON must be represented as objects with a t property containing the regex pattern as a string.
Array of Alternatives:
{"t": ["true", "false"]} // Matches: true OR falseWith Repetition:
{"t": [" ", "\t", "\n"], "repeat": "*"} // Matches: any whitespace sequenceFull Terminal Node:
{
"t": "/\\d+/", // Matches: 123, 456, 7890
"type": "Number",
"sample": "123",
"ast": ["num", ["$", "/raw"]]
}Regular Expression Syntax:
- In TypeScript/JavaScript, regex can be written as
/pattern/flags - In JSON, regex must be a string:
{"t": "/pattern/flags"} - Escape characters must be double-escaped in JSON strings:
"\\d+"instead of\d+
Repetition Patterns:
repeat: "*"means zero or more matches (equivalent to regex*)repeat: "+"means one or more matches (equivalent to regex+)- Only applicable when
tis an array of strings
{
"cst": {
"Null": "null", // Matches: null
"Number": "/\\-?\\d+(\\.\\d+)?/", // Matches: 123, -45.67, 0.5
"Boolean": {"t": ["true", "false"]}, // Matches: true OR false
"Whitespace": {"t": [" ", "\t", "\n"], "repeat": "*"}, // Matches: any whitespace
"Identifier": {
"t": "/[a-zA-Z_][a-zA-Z0-9_]*/", // Matches: varName, _temp, MY_CONST
"type": "Identifier",
"sample": "variable_name"
}
}
}Matches a sequence of grammar nodes in order. All nodes in the sequence must match for the production to succeed.
interface ProductionNode {
p: GrammarNode[]; // Sequence of nodes to match
type?: string; // Type name (default: "Production")
children?: Record<number, string>; // Child index to property mapping
ast?: AstNodeExpression; // AST transformation
}
// Shorthand: array of grammar nodes
type ProductionNodeShorthand = GrammarNode[];Shorthand Array:
["{", {"r": "Content"}, "}"] // Matches: { content }{
"cst": {
"FunctionCall": ["identifier", "(", {"r": "Arguments"}, ")"], // Matches: func(args)
"Assignment": [{"r": "Variable"}, "=", {"r": "Expression"}] // Matches: x = value
}
}Full Production Node:
{
"p": ["{", {"r": "Content"}, "}"], // Matches: { content }
"type": "Block",
"children": {
"1": "content"
}
}{
"cst": {
"FunctionCall": ["func", "(", ")"], // Matches: func()
"Assignment": {
"p": [{"r": "Identifier"}, "=", {"r": "Expression"}], // Matches: x = 5
"type": "Assignment",
"children": {
"0": "target",
"2": "value"
}
},
"IfStatement": {
"p": ["if", "(", {"r": "Expression"}, ")", {"r": "Statement"}], // Matches: if (x) stmt
"children": {
"2": "condition",
"4": "body"
}
}
}
}Matches one of several alternative patterns. The first matching alternative is selected (ordered choice).
interface UnionNode {
u: GrammarNode[]; // Array of alternative nodes
type?: string; // Type name (default: "Union")
ast?: AstNodeExpression; // AST transformation
}{
"u": ["pattern1", "pattern2", "pattern3"] // Matches: pattern1 OR pattern2 OR pattern3
}{
"cst": {
"Literal": {
"u": ["null", "true", "false", {"r": "Number"}, {"r": "String"}] // Matches any literal type
},
"Statement": {
"u": [ // Matches any statement type
{"r": "IfStatement"},
{"r": "ReturnStatement"},
{"r": "ExpressionStatement"}
]
},
"BinaryOperator": {
"u": ["+", "-", "*", "/", "==", "!=", "<", ">"] // Matches any operator
}
}
}Matches zero or more repetitions of a pattern.
interface ListNode {
l: GrammarNode; // Node to repeat
type?: string; // Type name (default: "List")
ast?: AstNodeExpression; // AST transformation
}{
"l": "pattern" // Matches: zero or more occurrences of pattern
}{
"cst": {
"Statements": {
"l": {"r": "Statement"} // Matches: multiple statements
},
"Parameters": {
"l": { // Matches: param1, param2, param3
"p": [",", {"r": "Parameter"}],
"ast": ["$", "/children/1"]
}
},
"Digits": {
"l": "/[0-9]/", // Matches: 123456789
"type": "DigitSequence"
}
}
}The Abstract Syntax Tree (AST) conversion process transforms the detailed Concrete Syntax Tree (CST) into a simplified, semantically meaningful tree structure suitable for further processing. A major advantage is that all transformations are specified purely in JSON, making them completely portable without requiring any language-specific code injections.
When no custom AST transformation is specified, the library generates a canonical AST node:
interface CanonicalAstNode {
type: string; // Node type
pos: number; // Start position
end: number; // End position
raw?: string; // Raw matched text (the actual text that was matched)
children?: (CanonicalAstNode | unknown)[]; // Child nodes
}The raw property contains the actual text that was matched by the grammar node, providing access to the original input text for that specific node.
The simplest way to customize AST generation is using the children property mapping in grammar nodes. This allows you to specify which child nodes should be included in the AST and assign them meaningful property names:
{
"cst": {
"Assignment": {
"p": [{"r": "Variable"}, "=", {"r": "Expression"}],
"children": {
"0": "target", // First child becomes "target" property
"2": "value" // Third child becomes "value" property
}
}
}
}AST transformations use JSON Expression syntax for powerful, declarative tree transformations. This system is completely portable since it's all JSON - no language-specific transformations or code injections are needed.
The transformation happens as follows:
- Default AST Creation: First, a default (canonical) AST node is created from the CST
- JSON Expression Application: Then a JSON Expression is applied to that node, allowing modification or extraction of specific parts
- Result Generation: The resulting JSON from the JSON Expression evaluation becomes the final AST node
- Bottom-up Processing: The process happens bottom-up, with resulting AST nodes supplied to parent node JSON Expression transformations as part of their
childrenarray - Children Array: The
childrenarray is the canonical way to specify all children in both CST and AST
Skip Node:
{"ast": null}Use Child Node:
{"ast": ["$", "/children/0"]}Custom Object:
{
"ast": {
"type": "CustomType",
"value": ["$", "/raw"]
}
}Value Extraction:
["$", "/raw"] // Get raw matched text
["$", "/children/0"] // Get first child's AST
["$", "/pos"] // Get start position
["$", "/end"] // Get end positionType Conversion:
["num", ["$", "/raw"]] // Convert to number
["bool", ["$", "/raw"]] // Convert to boolean
// Example transformation:
// Input CST: {type: "Number", raw: "42", pos: 0, end: 2}
// Transform: ["num", ["$", "/raw"]]
// Output AST: 42String Operations:
["substr", ["$", "/raw"], 1, -1] // Remove first and last character
["len", ["$", "/raw"]] // Get string length
// Example transformation:
// Input CST: {type: "String", raw: "\"hello\"", pos: 0, end: 7}
// Transform: ["substr", ["$", "/raw"], 1, -1]
// Output AST: "hello"Array Operations:
["push", [[]], ["$", "/children/0"]] // Create array with element
["concat", ["$", "/children/0"], ["$", "/children/1"]] // Concatenate arraysConditional Logic:
["?", ["==", ["$", "/raw"], "true"], true, false] // Ternary operatorObject Construction:
["o.set", ["$", ""], "key", ["$", "/children/0"]] // Set object propertyUse the children property to map CST child indices to AST properties:
{
"p": [{"r": "Key"}, ":", {"r": "Value"}],
"children": {
"0": "key",
"2": "value"
}
}This creates an AST node with key and value properties instead of a children array.
{
"cst": {
"Boolean": {"t": ["true", "false"]}
},
"ast": {
"Boolean": ["==", ["$", "/raw"], "true"]
}
}{
"cst": {
"Number": "/\\d+/"
},
"ast": {
"Number": ["num", ["$", "/raw"]]
}
}{
"cst": {
"String": "/\"[^\"]*\"/"
},
"ast": {
"String": ["substr", ["$", "/raw"], 1, -1]
}
}{
"cst": {
"Array": ["[", {"r": "Elements"}, "]"]
},
"ast": {
"Array": ["$", "/children/1"]
}
}{
"cst": {
"CommaSeparated": {
"p": [
{"r": "Item"},
{
"l": {
"p": [",", {"r": "Item"}],
"ast": ["$", "/children/1"]
}
}
],
"ast": ["concat", ["push", [[]], ["$", "/children/0"]], ["$", "/children/1"]]
}
}
}A basic calculator that handles addition and multiplication with proper precedence:
{
"start": "Expression",
"cst": {
"Expression": {
"p": [{"r": "Term"}, {"l": {"p": [{"r": "AddOp"}, {"r": "Term"}]}}],
"ast": ["foldl", ["$", "/children/0"], ["$", "/children/1"]]
},
"Term": {
"p": [{"r": "Factor"}, {"l": {"p": [{"r": "MulOp"}, {"r": "Factor"}]}}],
"ast": ["foldl", ["$", "/children/0"], ["$", "/children/1"]]
},
"Factor": {
"u": [
{"r": "Number"},
{"p": ["(", {"r": "Expression"}, ")"], "ast": ["$", "/children/1"]}
]
},
"Number": "/\\d+/",
"AddOp": {"u": ["+", "-"]},
"MulOp": {"u": ["*", "/"]}
},
"ast": {
"Number": ["num", ["$", "/raw"]],
"AddOp": ["$", "/raw"],
"MulOp": ["$", "/raw"]
}
}A complete JSON parser grammar:
{
"start": "Value",
"cst": {
"WOpt": {"t": [" ", "\n", "\t", "\r"], "repeat": "*", "ast": null},
"Value": [{"r": "WOpt"}, {"r": "TValue"}, {"r": "WOpt"}],
"TValue": {
"u": [
{"r": "Null"},
{"r": "Boolean"},
{"r": "String"},
{"r": "Object"},
{"r": "Array"},
{"r": "Number"}
]
},
"Null": "null",
"Boolean": {"t": ["true", "false"]},
"Number": "/\\-?(0|([1-9][0-9]*))(\\.\\d+)?([eE][\\+\\-]?\\d+)?/",
"String": "/\"[^\"\\\\]*(?:\\\\.|[^\"\\\\]*)*\"/",
"Array": ["[", {"r": "Elements"}, "]"],
"Elements": {
"u": [
{
"p": [
{"r": "Value"},
{
"l": {
"p": [",", {"r": "Value"}],
"ast": ["$", "/children/1"]
}
}
],
"ast": ["concat", ["push", [[]], ["$", "/children/0"]], ["$", "/children/1"]]
},
{"r": "WOpt"}
]
},
"Object": ["{", {"r": "Members"}, "}"],
"Members": {
"u": [
{
"p": [
{"r": "Entry"},
{
"l": {
"p": [",", {"r": "Entry"}],
"ast": ["$", "/children/1"]
}
}
],
"ast": ["concat", ["push", [[]], ["$", "/children/0"]], ["$", "/children/1"]]
},
{"r": "WOpt"}
]
},
"Entry": {
"p": [{"r": "WOpt"}, {"r": "String"}, {"r": "WOpt"}, ":", {"r": "Value"}],
"children": {
"1": "key",
"4": "value"
}
}
},
"ast": {
"Value": ["$", "/children/1"],
"Boolean": ["==", ["$", "/raw"], "true"],
"Number": ["num", ["$", "/raw"]],
"String": ["substr", ["$", "/raw"], 1, -1],
"Array": ["$", "/children/1"],
"Object": ["$", "/children/1"],
"Elements": ["?", ["len", ["$", "/children"]], ["$", "/children/0"], [[]]],
"Members": ["?", ["len", ["$", "/children"]], ["$", "/children/0"], [[]]]
}
}- Start Simple: Begin with basic rules and gradually add complexity
- Use Meaningful Names: Choose descriptive rule names that reflect their purpose
- Leverage Shortcuts: Use shorthand syntax where appropriate for cleaner grammars
- Whitespace Handling: Create dedicated whitespace rules with
ast: nullfor clean ASTs - Left Recursion: Avoid left-recursive rules; use right-recursion with lists instead
- Order Alternatives: Place most common alternatives first in union nodes
- Minimize Backtracking: Design grammars to reduce ambiguity
- Atomic Groups: Use terminal nodes for performance-critical patterns
- Sample Data: Provide sample strings for testing and validation
- Semantic Focus: Include only semantically meaningful information in ASTs
- Consistent Structure: Maintain consistent AST node shapes across similar constructs
- Type Safety: Use clear, descriptive type names for AST nodes
- Flatten Lists: Transform complex nested structures into simpler forms
- Debug Mode: Use debug compilation for grammar development
- Incremental Testing: Test grammar rules individually before combining
- Trace Analysis: Leverage debug traces to understand parse failures
- Sample Validation: Verify grammars against known good and bad inputs
Debug traces can be captured during grammar development to understand parsing behavior. Debug trace nodes typically form a tree structure that mirrors the grammar execution, allowing developers to:
- Trace Execution: Follow the parser's decision-making process through the grammar
- Identify Failures: Pinpoint where parsing fails and why certain rules don't match
- Performance Analysis: Understand which rules are expensive or cause excessive backtracking
- Grammar Validation: Verify that the grammar behaves as expected on test inputs
Debug trace nodes generally contain information about:
- Rule name being executed
- Input position and matched text
- Success or failure status
- Child trace nodes for nested rule calls
- Backtracking information
Most JSON Grammar implementations provide utilities to:
- Print Grammar Structure: Display the grammar in a human-readable format
- Visualize Parse Trees: Show the concrete syntax tree structure
- Export Grammar Metadata: Generate documentation or schema information
- Validate Grammar Rules: Check for common issues like left recursion or unreachable rules
- Documentation: Include comments and examples in grammar files
- Modularity: Break complex grammars into logical sections
- Version Control: Track grammar evolution through version control
- Testing: Maintain comprehensive test suites for grammar rules
{"u": [{"r": "Element"}, ""]} // Matches: Element OR nothing{
"u": [
{
"p": [ // Matches: item1, item2, item3
{"r": "Item"},
{"l": {"p": [",", {"r": "Item"}], "ast": ["$", "/children/1"]}}
],
"ast": ["concat", ["push", [[]], ["$", "/children/0"]], ["$", "/children/1"]]
},
"" // OR empty list
]
}{
"cst": {
"Identifier": {
"u": [ // Matches: any identifier except keywords
{"r": "Keyword"},
"/[a-zA-Z_][a-zA-Z0-9_]*/"
]
},
"Keyword": {
"u": ["if", "else", "while", "for", "return"] // Matches: reserved keywords
}
}
}