feat(express): Introduce A2UI Express compiler, draft proposals, and evaluations#1678
feat(express): Introduce A2UI Express compiler, draft proposals, and evaluations#1678gspencergoog wants to merge 10 commits into
Conversation
7d7fdeb to
e242e09
Compare
f696fd7 to
cc1114d
Compare
refactor(express): implement thread-safety, type safety, and operations constants - Refactors `ExpressCompiler` to use a stateless, thread-safe `_CompileContext` instance for compiler invocations, preventing race conditions during concurrent execution. - Introduces `SurfaceOperation` constants for standard A2UI surface message envelope keys, removing hardcoded string lookups across the compiler and decompiler. - Adds explicit, strict type annotations to `_load_mappings` and `get_property_enum` in `CatalogSchemaHelper`. - Adds `test_compiler_concurrency` regression test to verify concurrent compile runs on a single compiler instance, with overall package test coverage maintained above 90%. docs(express): refocus README on Gemini API execution - Restructures `specification/proposals/express/README.md` to highlight executing inference and validation using remote Gemini models (e.g. `gemini-3.1-flash-lite`). - Relocates local Apple Silicon MLX setup instructions to the bottom of the document. docs(express): update README with env requirements and remove outdated documents - Updates `specification/proposals/express/README.md` to document the mandatory `A2UI_EXPRESS_ENABLED` environment variable gate. - Standardizes command usage in README to use `uv run` with the environment variable. - Resolves file extension in compiler example from `.express` to `.a2ui`. - Removes outdated `basic_prompt.md` and `evolve_express.md` files from the proposals directory. fix(express): correct compiler check mapping, multiline string parsing, and decompiler formatting - Fixes positional check parsing shifts so checks do not map to preceding optional properties (like weight). - Corrects multiline string compiler logic to preserve blank lines inside active statements. - Improves check argument matching to map string literals to custom validation messages when property schemas expect non-string types. - Optimizes the decompiler to strip unnecessary "_" placeholders. - Adds comprehensive round-trip tests to verify all 36 basic catalog examples against their Express DSL counterparts, achieving 91% total coverage. - Fixes missing Optional typing imports in schema_helper.py and prompt_generator.py. - Fixes fix_format.sh to make corepack enable non-fatal, preventing permission errors from aborting formatting early on developer machines. feat(express): introduce A2UI Express compilation pipeline, specification proposals, and evaluations - Introduce A2UI Express—an experimental, compact DSL notation allowing agents to describe UI layouts using minimal tokens. - Add `agent_sdks/python/a2ui_agent/src/a2ui/express/` package implementing the `ExpressCompiler` (translating flat DSL to standard wire JSON) and `ExpressDecompiler` (reconstructing DSL from wire JSON). - Secure Express package behind a disabled-by-default environment flag `A2UI_EXPRESS_ENABLED`. - Relocate and formalize the specification draft to `/specification/proposals/express/` to avoid polluting the v1.0 baseline. - Add the `express` layout generation strategy inside evaluations (`eval/`), featuring support for running it optionally via a comma-separated or repeating `--strategies` CLI list parameter. - Exclude research-facing genetic prompt optimizer tooling from this base branch (moved to `a2ui_express_optimizer` branch).
dfcec4a to
71b9534
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces A2UI Express, a compact, token-efficient declarative DSL for generative user interfaces, complete with a compiler, decompiler, prompt generator, CLI tools, and integration into the evaluation framework. The code review highlights several critical issues, including a sentinel parsing bug that ignores statements on the same line as tags, a potential crash in the schema helper when handling boolean schemas in allOf, and a decompiler bug that incorrectly strips quotes from string literals matching component IDs. Additionally, the reviewer noted hardcoded absolute paths in the dataset translation script and a PEP 8 import style violation in the evaluation solver.
- Fixes compiler sentinel tag parsing to strip tags and process statement content remaining on the same line. - Adds boolean schema type guards in `CatalogSchemaHelper._load_mappings` when resolving `allOf` elements. - Implements `is_ref` parameters in `ExpressDecompiler._decompile_value` to prevent string literal values matching component IDs from being decompiled without quotes. - Corrects path resolution in `translate_dataset.py` to be relative to the script directory. - Re-orders imports in `express.py` strategy file to follow PEP 8. - Adds comprehensive regression tests in `test_express.py` covering all fixes, maintaining 100% pass rate.
| @solver | ||
| def a2ui_express_prompt(catalog_path: str) -> Solver: | ||
| """Solver to inject A2UI Express prompt contract instructions.""" | ||
| generator = ExpressPromptGenerator(catalog_path) |
There was a problem hiding this comment.
Next, let's combine ExpressPromptGenerator and ExpressCompiler into one ExpressInferenceFormat which implements AbstractInferenceFormat, so that people can implement other inference formats and reuse this inspect_ai strategy.
There was a problem hiding this comment.
Good idea. I'll leave that for another PR though.
| return results | ||
|
|
||
|
|
||
| class ExpressDecompiler: |
There was a problem hiding this comment.
Yes! Love the idea of having the reverse direction too, so we can make sure that the model also understands sees content through the DSL
There was a problem hiding this comment.
This was partly to make sure that I could "round trip" things to make sure that the express format could support everything that A2UI can. It was also helpful in converting examples in the catalog to use in the prompt.
There was a problem hiding this comment.
I think this is also important to handle conversation history, assuming it is stored in the more standardized, stable A2UI format. If the agent is communicating back and forth with A2UI, we want it to always read and write A2UI in the same format, to avoid confusion (and maximise efficiency!). So I think we need the decompiler for this.
Steps for an inference in a multi-turn conversation:
- Convert all A2UI messages in existing conversation history to Express format (required decompiler)
- Convert Catalog to express format
- Prompt agent with system prompt in express format, and conversation history also using express format.
- Parse express output back to A2UI (requires compiler)
- Persist conversation history in A2UI format.
There was a problem hiding this comment.
Do these examples need to be checked in? Could we generate them on demand using the decompiler from the canonical A2UI examples?
There was a problem hiding this comment.
We could generate them, yes. That's how I produced them.
There was a problem hiding this comment.
Amazing! Should this be checked in though?
There was a problem hiding this comment.
No, that is an artifact of the improvement loop I was running. I'll remove it.
|
Are you intending to land this PR as-is? Any chance of breaking it up a bit, to make it easier to do thorough reviewing? |
| The design of A2UI Express focuses on four main requirements: | ||
|
|
||
| - Token footprint reduction. Generative models spend excessive output tokens when producing verbose JSON structures. A2UI Express removes structural keys, brackets, and repeated quotes, reducing output tokens by 55% to 70% compared to native A2UI wire payloads. | ||
| - On-device model optimization. Small local models, such as Gemma 4 E2B and E4B, operate with limited context windows and reasoning budgets. The syntax uses clean positional signatures that fit into prompt contracts without consuming excessive context. |
There was a problem hiding this comment.
What does "clean positional signatures" refer to?
There was a problem hiding this comment.
It just means that the DSL uses argument positions without named arguments. "Clean" is because they don't have the named arguments, I guess.
|
|
||
| The syntax supports three literal primitive types: | ||
|
|
||
| - Strings are enclosed in straight double quotes, for example `"Enter your name"`. |
There was a problem hiding this comment.
Embedded newlines are allowed, they just have to be properly closed.
I added more rules about quoting and escaping too:
- Strings are represented in two formats:
- Standard Strings: Enclosed in single double quotes (e.g.,
"Enter your name") or triple double quotes (e.g.,"""Line 1\nLine 2"""). Standard strings support common escape sequences:\n(newline),\t(tab),\\(backslash), and\"(double quote). Embedded newlines are allowed. - Raw Strings: Prefaced by
r(e.g.,r"^[a-zA-Z]+$"orr"""Raw multi-line content"""). In raw strings, no escape sequences are processed, and backslashes are interpreted as literal characters. This is particularly useful for validation regex patterns containing backslashes. Embedded newlines are allowed.
- Standard Strings: Enclosed in single double quotes (e.g.,
|
|
||
| ## Compilation pipeline | ||
|
|
||
| The compilation pipeline runs on the host application. It takes the plain text stream of A2UI Express, processes it, and emits a standard A2UI v1.0 JSON payload. |
There was a problem hiding this comment.
The compilation pipeline runs on the host application.
Do we need to opine on this? I thought the compiler runs (and is an implementation detail of the agent). If it runs on the client, we have more problems like dealing with versioning
There was a problem hiding this comment.
It does. I think "host application" is the agent here. I'll reword it.
|
Some more general comments:
|
No, I mainly wanted to get your (and Jia Hao's) impressions on it. I'll split it up today into at least two PRs, one with the proposal directory and docs, and one with the Python agent implementation and evals, taking into account both of your review comments. I was maybe a little overexcited to share the results and it needs some more pruning and cleaning. |
Yes, I agree. I did run the direct model on Lite too, so it's at least comparable, but you're right that there isn't a lot more reasoning overhead left there after building the UIs. I mainly wanted to see 1) that a Lite model could handle the reasoning needed (showing that it wasn't too reasoning heavy), and 2) to see how fast we could go if latency were the only driver. It works well in Flash 3.5, but is of course slower. I found that limiting the thinking budget helped a bit there.
Yes, that is my intention. The rendering clients should have no idea that Express was ever involved, and just see regular A2UI. |
| username = Text($/username, "caption") | ||
| bio = Text($/bio, "body") | ||
| stats-row = Row([followers-col, following-col, posts-col], "spaceAround") | ||
| followers-col = Column([followers-count, followers-label], _, "center") |
There was a problem hiding this comment.
Are we sure we want to support names that contain "-" and other special characters? In the future, when models become more advanced, we might want to introduce features that would take advantage of special characters, but allowing them in identifier names could cause ambiguities in the grammar.
There was a problem hiding this comment.
Good point. We probably should limit it to the recommendations in https://www.unicode.org/reports/tr31/ so that we can allow Unicode characters in identifiers (I recently added this to A2UI as well).
| root = Card(main-column) | ||
| main-column = Column([title-text, markdown-content], _, "stretch") | ||
| title-text = Text("### Markdown Rendering") | ||
| markdown-content = Text("# Heading 1 |
There was a problem hiding this comment.
Another " can almost certainly appear inside a multiline string like this. How do we ensure we can still parse it correctly?
There was a problem hiding this comment.
"Naaah, that'll
"never" happen." :-)
Sure, we should specify quoting rules. I really like the """ rules that other languages like Dart have. I want to make sure we can keep the prompt small though, so it's easy to describe, but I don't want to open the door to something like just saying "Use Python string quoting rules" because that opens the door to supporting lots of other syntax (raw string quoting, multiple character escape formats, misinterpreting conflicting string interpolation formats, etc.). Maybe there's a way we can say it that is concise but doesn't mention a language.
e.g. "For quoting strings, surround with a double quote. If a double quote appears in the string, then use """ around the string. Individual double quotes may also be escaped with a preceding backslash. For strings that are raw strings interpreted literally, precede the string with an r: r" or r""" ".
Unlike other languages, one of the driving criteria here is that its rules are able to be concisely described in English prose.
| $/now = "2025-12-15T12:00:00Z" | ||
| root = Card(main-column) | ||
| main-column = Column([welcome-text, email-field, phone-field, zip-field, terms-checkbox, submit-btn], _, "stretch") | ||
| welcome-text = Text(formatString("Hello! Today is ${formatDate(value: ${/now}, format: 'EEEE, MMMM d')}.")) |
There was a problem hiding this comment.
Do we need formatString? Dart and JS can format strings without it. Just the ${ syntax is enough. It's also fewer tokens this way.
There was a problem hiding this comment.
formatString is implemented in the catalog. This makes it so that a catalog author can implement whatever string formatting they want to use (e.g. you could implement a printf function and use printf formatting instead). It also makes it a lot easier to implement a renderer, since the string formatting isn't part of the specification, and implementing formatString is actually pretty involved (it's reactive, and so generates new strings whenever the interpolated values, including interpolated function calls, change, for instance).
We provide an implementation in the core libraries that catalog writers can leverage so they can just include it.
I can see the argument for including it in the language, but if we do that for Express, it needs to be included in A2UI as well, and implemented in all renderers.
There was a problem hiding this comment.
I agree with just using our existing formatString function call approach for the first cut of express, to keep it catalog-agnostic.
But in the future, I think it'd be interesting to pursue these optimizations which make the format catalog-specific in service of greater performance gains.
We might be able to generalize this - JSON render has a concept of "directives" which is something like this - https://json-render.dev/docs/directives.
So perhaps you create a generic ExpressInferenceFormat and then install a directive that transforms ${/number} to formatString(/number) etc
munificent
left a comment
There was a problem hiding this comment.
@yjbanov asked me to take a look with an eye to language design stuff. I have very little context on the overall problem being solved and I'm definitely not an expert at what kinds of code LLMs are good at writing and reading, so take all of this with a very large grain of salt.
Most of my comments are probably fine but I figured it's better to bring it up then potentially miss a problem.
Overall, this makes sense to me. Unlike a syntax where the entire thing is one monolithic tree which isn't meaningfully parsable until you have the whole thing, it gives you a way to break the code into smaller separately handle-able units.
That does raise questions around forward declarations and how names are resolved and managed. Name resolution in general is pretty vague here and is something you'll likely want to be pedantic about. It's a part of language design that has a lot of sharp edges.
Also, an explicit grammar in something like EBNF would be nice to see. I know it's not something that everyone loves, but the exercise of writing it will force you to answer a lot of things that might otherwise be left implicit and then become subtle parser bugs. (For example, it's not clear from the spec here if component definitions can have map literals as arguments or not. Can lists have trailing commas? Be empty? Be empty except for just a comma?)
I'm always excited to see people doing novel language design to approach a problem a different way! :D
|
|
||
| The design of A2UI Express focuses on four main requirements: | ||
|
|
||
| - Token footprint reduction. Generative models spend excessive output tokens when producing verbose JSON structures. A2UI Express removes structural keys, brackets, and repeated quotes, reducing output tokens by 55% to 70% compared to native A2UI wire payloads. |
There was a problem hiding this comment.
It's surprising that this reduces token size so much. In theory, a directly nested syntax should be more concise than what's proposed here because it avoids repeating path components, as in:
foo/bar/baz/bang/a = 1
foo/bar/baz/bang/b = 2
foo/bar/baz/bang/c = 3
foo/bar/baz/bang/d = 4
foo/bar/baz/bang/e = 5
// 55 tokens ("token" in the PL sense, not LLM sense)
// Versus:
foo(
bar(
baz(
bang(
a = 1
b = 2
c = 3
d = 4
e = 5
)
)
)
)
// 27 tokens
So is the improvement here really from the flattening, or from not having a nesting syntax that has a lot of other unnecessary boilerplate like quoted key names, argument names, comma separators, etc.?
If so, perhaps it would be beneficial for this notation to allow nesting too?
There was a problem hiding this comment.
There were two reasons I stayed away from this kind of nesting:
-
if we want to "stream" user interfaces, nesting like this is hard, since we have to wait until the end, or do some auto-closing magic in order to do intermediate states. If I can use adjacency lists to separate the components from the list, then it makes streaming much nicer: we ignore symbols that we don't recognize yet, and fill them in when they are defined. This lets the LLM be sloppy about ordering the definitions, and
-
the LLM then doesn't need to keep track of all the nesting paren levels, which it is not terribly good at.
|
|
||
| - Token footprint reduction. Generative models spend excessive output tokens when producing verbose JSON structures. A2UI Express removes structural keys, brackets, and repeated quotes, reducing output tokens by 55% to 70% compared to native A2UI wire payloads. | ||
| - On-device model optimization. Small local models, such as Gemma 4 E2B and E4B, operate with limited context windows and reasoning budgets. The syntax uses clean positional signatures that fit into prompt contracts without consuming excessive context. | ||
| - Streaming compatibility. The line-oriented grammar allows the client host to parse and build the component hierarchy line-by-line, enabling progressive rendering of the interface before the model finishes its output. |
There was a problem hiding this comment.
One of the examples below looks like:
root = Card(main-column)
main-column = Column([icon, title, description, actions], _, "center")
icon = Icon($/icon)
title = Text($/title, "h3")
...
Lines here often refer to names declared on later lines. That implies that we can't always process lines as they come in, unless the system can gracefully handle references to unknown entities.
There was a problem hiding this comment.
Yes, this is on purpose to let the LLM control streaming behavior. The A2UI renderers already handle this by ignoring symbols that they don't recognize until they are defined, and also caching symbol definitions that aren't yet connected to anything until they get used. This lets us stream in a Column with identifiers for the children, and then fill in the children as the come in, or vice versa and have them all pop in at once when the Column definition arrives.
|
|
||
| ### Variable declarations | ||
|
|
||
| Every component definition is assigned to a unique, alphanumeric variable. The compiler uses these variables to resolve parent-child hierarchies. A reserved variable named `root` acts as the primary entry point for the interface tree. |
There was a problem hiding this comment.
What happens if a user refers to root? What does this do:
root = Row([root])
Or is it "write-only" in some way?
Related: how are cyclic references handled?
There was a problem hiding this comment.
It is an error to have circular references.
The renderers will throw an error back to the agent if they catch circular references. This is a somewhat bad design in that in order to catch these (algorithmically) on the server before they hit the client, the server has to keep track of everything the client has seen. On the other hand, the renderer is the final say as to the actual state of things, so maybe it's appropriate there.
|
|
||
| Every component definition is assigned to a unique, alphanumeric variable. The compiler uses these variables to resolve parent-child hierarchies. A reserved variable named `root` acts as the primary entry point for the interface tree. | ||
|
|
||
| To eliminate syntax errors from complex bracket structures and enable line-oriented streaming compilation, A2UI Express prohibits inline component nesting. Component constructor calls (e.g., `Text(...)`, `Column(...)`) can **only** appear on the right-hand side of a variable assignment (`var = ComponentName(...)`). They **cannot** be passed directly as positional arguments to other components. Instead, you must declare them separately and reference their variable names. |
There was a problem hiding this comment.
For what it's worth, I've seen various little DSLs and hobby languages over the years try to stake a claim like this in the name of simplicity (or because their authors aren't comfortable writing a full expression parser) and most usually end up dialing it back over time. It becomes really annoying if you can't do any computation in a nested expression.
If someone wants to do:
root = Framed(app-frame-thickness + (is-android ? android-frame-adjust : ios-frame-adjust) + 4)
Do you really want them to have to write something like:
a = is-android ? android-frame-adjust : ios-frame-adjust
b = app-frame-thickness + a
c = b + 4
root = Framed(c)
If this DSL is really only for authoring component trees, it's probably fine. But you do have literal values and even lists. Presumably it will be useful to add numbers, concatenate strings, or append to lists. Having to hoist all of that out to separate named declarations could get really annoying.
Though if this code will never be written or read by a human... 🤷
There was a problem hiding this comment.
Well, exactly, it will not be written by a human. Which is weird.
The prohibition there is actually there more to keep an LLM from writing an entire tree in one expression, so that streaming works better. It forces it to split it up into a bunch of lines that can be evaluated as they come in.
|
|
||
| The syntax supports three literal primitive types: | ||
|
|
||
| - Strings are enclosed in straight double quotes, for example `"Enter your name"`. |
| - Client functions are written as `<FunctionName>(<args>)`, matching the exact function names registered in the loaded catalog. | ||
| - If the client catalog contains a text formatting helper (such as `formatString`), it is called explicitly: `welcomeText = Text(formatString("Welcome, ${/user/firstName}!"))`. This prevents failures if a client catalog uses a different naming convention for interpolation. | ||
| - Local actions use this same signature to trigger behaviors, for example `openUrl("https://example.com")`. The compiler maps these to standard client function actions. | ||
| - Server events use a reserved `Event` signature to declare backend actions, for example `Event("save_deal", {rep: $/form/rep})`. |
There was a problem hiding this comment.
So it seems like map literals can be used as expressions basically anywhere? If so, you probably want to add them to ### Core primitive types.
There was a problem hiding this comment.
Yeah, good point, I'll do that.
|
|
||
| ### Validation and logic expressions | ||
|
|
||
| Validation checks are defined using the `?` prefix. If a component expects validation rules, the compiler converts these expressions into standard client-side functions: |
There was a problem hiding this comment.
I don't have enough context to know what "validation" means here. But if the leading ? is just syntactic sugar for calling a function with that name, does it carry its weight?
There was a problem hiding this comment.
Validation here is in the context of "form validation", in the sense of wanting to check that a text field contains an email address, for instance. It is handled by defining a client side function to do the validation and return a boolean. Any function that returns a boolean can be used as a validation function.
The ? is just syntactic sugar for calling a boolean-returning function that takes an implicit first "value" first argument.
To be honest, I haven't really thought this part through that well. I think probably it could just be a regular function call syntax, but that does have slightly higher (LLM) token size (not much though).
Maybe it should instead be something like
username = TextInput($/form/username, [required(_), regex(_, "^[0-9]{5}$")])
and we can sub in the value for the _. Right now that would be:
username = TextInput($/form/username, [?required, ?regex("^[0-9]{5}$")])
|
|
||
| - Simple checks are written with the function name, for example `?required`. | ||
| - Parameterized checks accept arguments in parentheses, for example `?regex("^[0-9]{5}$", "Must be a valid zip code")`. | ||
| - Multiple checks are grouped in lists: `[?required, ?email]`. |
There was a problem hiding this comment.
So the system implicitly understands that a list containing validation checks is itself a validation check? What about:
[?required, "oops, not a validation check"]
Would it make more sense to do:
?[required, email]
There was a problem hiding this comment.
So the system implicitly understands that a list containing validation checks is itself a validation check? What about:
[?required, "oops, not a validation check"]
This would fail because the string isn't a boolean.
Would it make more sense to do:
?[required, email]
No, the values don't have to be functions, they could also be from the data model.
|
|
||
| ### Line parsing and tokenization | ||
|
|
||
| The compiler reads the input text line-by-line. It discards empty lines and parses assignments into tokens. |
There was a problem hiding this comment.
You do say an assignment can span multiple lines, so it's probably clearer to say that separate top-level assignments or standalone operations may be executed before later ones are parsed. Is that the intent here?
Are there comments?
There was a problem hiding this comment.
Yes, that's the intent.
There aren't explicitly comments, but I do actually ignore both # and // end of line comments in the parser because the LLM sometimes adds them anyhow. We don't want to mention or "allow" them because they just take up tokens we're not going to use.
|
|
||
| If the compiler encounters a syntax error or catalog schema mismatch during parsing, it triggers a structured error recovery workflow: | ||
|
|
||
| 1. Isolation. The compiler flags the invalid line, discards that sub-branch of the AST, and continues parsing the remaining lines to avoid collapsing the user interface. |
There was a problem hiding this comment.
What if the offending line is defining some name that is referred to elsewhere on other lines?
There was a problem hiding this comment.
The other place will just ignore the undefined value. It might not render because of that, but it would just wait until there's a valid value there, which could come in a an error correction update.
* Add lexer regexes and token parsing logic for standard triple-quoted strings (using a refined lookahead pattern to support nested quotes) and raw strings (single/triple quoted with zero escape processing). * Implement strict unescaping for standard strings, resolving only \n, \t, \\, and \", and treating all other escape sequences literally. * Update prompt generator instructions to include the simplified raw/triple string rules. * Implement decompiler changes to format string values into the most readable quote style (raw strings for paths/regexes, triple quotes for multi-line or quote-nested strings). * Update standard catalog example files and evaluation strategy documentation to use the new string formats. * Add comprehensive test suite covering all string quoting, escaping, and formatting choices.
46b3073 to
6c527e0
Compare
…mpts to format-agnostic - Move a2ui.express package to a2ui.experimental.express. - Rename prompt texts in v1_0_prompts.yaml to be format-agnostic (removing JSON/createSurface terminology). - Delete regex prompt-rewriting hacks from eval/a2ui_eval/strategies/express.py. - Add parse_express_response parser helper in python SDK and use it in express strategy solver. - Move development run_* helper scripts to specification/proposals/express/scripts/ subfolder. - Remove temporary leaderboard.json artifact. - Fix Prettier and Pyink formatting across changed files.
…ocks - Convert all multi-line and long strings in v1_0_prompts.yaml to use literal block scalar notation (| or |-). - Remove all manual quote escapes and newline characters from the prompt entries.
- Replace legacy references to createSurface and updateComponents in registration, cart, openUrl, and nestedLayout prompts with format-agnostic descriptions.
- Update TOKEN_SPEC lexer rules in compiler.py to support unclosed strings at end-of-stream and separate horizontal/vertical whitespace. - Shift statement slicing from a raw line-by-line balancer to a token-by-token statement grouper. - Add is_final parameter to compile/tokenize to manage streaming chunks versus completed inputs. - Add specification details on string literal variants in a2ui_express.md. - Add tests for multi-line unescaped parenthesis in strings, parser syntax checks, and unbalanced trailing structures.
40c7583 to
ce0af3c
Compare
Variable identifiers now strictly conform to the Unicode Identifier standard (UAX a2ui-project#31), allowing Unicode letters, digits, and underscores, but forbidding dashes (-). This prevents naming ambiguity with future expression/subtraction syntax support. - Update identifier regex pattern to `[^\W\d]\w*` in `compiler.py`. - Document identifier rules in `a2ui_express.md` and `prompt_generator.py`. - Convert all example `.a2ui` variables and `.json` component IDs to use underscores instead of dashes. - Refactor python tests in `test_express.py`. BREAKING-CHANGE: Separators like dashes (`-`) are no longer permitted in A2UI Express variable names. Existing DSL definitions containing dashes in variables will fail parsing.
Thank you for taking a look! I really appreciate the feedback.
Yes, we want to be able to stream the data, make corrections, and recover from missing pieces.
Okay, point well taken. That makes a lot of sense. I'll see if I can lock that down.
Also a great idea. We won't be giving it to the LLM because that's too verbose, but we need it for the compiler and it would help formalize the language.
Thanks! This one is weird because
|
|
One thing I’m curious about: since Express drops many of the JSON/property keys, have you observed any impact on generation quality from losing those semantic cues? For example, keys like I can see the token-efficiency benefit, so I’m mostly wondering whether this showed up in practice, or whether the shorter format generally outweighed the loss of those semantic anchors. |
The thing I thought might be a problem, but doesn't appear to be, is that LLMs aren't great at counting, so I thought that positional parameters would be the issue (getting them in the wrong order, or inserting something between them). The lack of property keys also doesn't seem to affect quality as long as the descriptions from the JSON schema in the catalog are included. For example, if the catalog item for TextField is converted to this: • TextField(label, value?, placeholder?, variant? (static only), weight? (static only), checks? (static only))
- label: The text label for the input field.
- value: The value of the text field.
- placeholder: The placeholder text for the input field.
- variant: The type of input field to display. Must be one of: 'longText', 'number', 'shortText', 'obscured'
- weight: The relative weight of this component within a Row or Column. This is similar to the CSS 'flex-grow' property. Note: this may ONLY be set when the component is a direct descendant of a Row or Column.Then the LLM has enough context to be able to decide what each argument means when it writes the output. We may have to play with how the catalogs are converted to prompts to minimize the tokens in the prompt, but we will need to include the entire descriptions that are supplied by the catalog developers because they can include important instructions for how to use the components. In fact, for a while I wasn't including any of the descriptions at all (which saved a lot of input tokens!), and as long as the parameter was something well named and intuitive, the LLM seemed to be able to extrapolate. If anything was vague or unconventional then it started to break down, however. |
Resolved merge conflicts in eval/a2ui_eval/scorers.py and eval/tasks.py. Refactored express evaluation solvers to be dynamic and resolve the catalog path from TaskState metadata at runtime.
|
FYI, I created a stack of three PRs that I split this PR into:
The last two are diffs on the first one and each other to make a stacked set of PRs. The last two are PRs on my fork, not on the main repo, until we roll the first one into the repo, and then I'll change their target to the main repo and land them too. I converted this PR back to a draft and will close it once the other PRs land. |
…oducing a highly compressed, model-optimized declarative syntax (DSL) for generative user interfaces. It includes the compiler, decompiler, schema helper, and parser modules. This contains the A2UI Express compiler/decompiler portions of a2ui-project#1678, with some additional issues fixed, additional tests, and refinements. * **Compiler & Parser**: Implemented the `ExpressCompiler` and `Parser` in `a2ui.experimental.express` to parse line-oriented DSL and compile it into standard A2UI v1.0 JSON. Supports standard strings, raw strings (`r"..."`), raw multi-line strings (`r"""..."""`), and partial streaming recovery. * **Strict Enum Validation**: Added strict validation for component property enums to raise ValueError on invalid inputs instead of silently ignoring them. * **Event Context Compilation**: Simplified event context processing to avoid redundant compilation. * **Decompiler**: Implemented `ExpressDecompiler` to convert standard v1.0 JSON payloads back into compact Express DSL. * **Schema Helper & Prompt Generator**: Implemented `ExpressPromptGenerator` to compile active catalog schemas into positional signatures used by generative models. * **Examples**: Added 36 `.a2ui` layout examples and corresponding compiled `.json` targets. * **Format Checks**: Integrated `pyink` style verification for specification proposals. The feature is fully experimental and gated behind the `A2UI_EXPRESS_ENABLED=true` environment variable. It does not affect any stable production paths. * Added 44 comprehensive unit tests in `tests/express/` covering parser correctness, thread-safe compilation, raw string escaping, strict enum validation, and round-trip integrity.
…s into the Inspect-ai evaluation suite. It updates the scorers, solvers, and task configurations to support v1.0 and Express DSL targets. This contains the portions of a2ui-project#1678 which integrates A2UI Express into the evaluation suite, with some additional issues fixed, additional tests, and refinements. Changes: * **Inspect Solver**: Implemented the `express` solver strategy in `eval/a2ui_eval/strategies/express.py` to rewrite prompts for Express DSL targets and extract `<a2ui>` sentinel blocks. * **Inspect Scorer**: Updated `a2ui_scorer` to support v1.0 and compile generated Express outputs before validating them against the schema. * **Datasets**: Added the translated `v1_0_prompts.yaml` dataset containing prompt targets updated for v1.0 component requirements. * **Documentation**: Checked in `express_dsl_examples.md` describing component signatures for model context. * **Unit Tests**: Added `test_strategies.py` and updated CI test runs to verify the evaluation strategies. Impact & Risks: None. This is an evaluation-only integration and does not affect runtime SDK paths. Testing: * Added 23 integration tests covering dataset loading, scoring, and solver rewriting. All tests pass successfully.
…#1726) ## Summary This PR implements the A2UI Express technical specification, introducing A2UI Express — a highly compressed, model-optimized declarative syntax (DSL) for generative user interfaces. It provides a complete end-to-end Python implementation, including an ANTLR-based compiler, a decompiler, a schema-based system prompt generator, helper scripts, and comprehensive test suites. This is a refined, standalone extraction of the A2UI Express compiler/decompiler portions originally proposed in PR #1678, incorporating automated parser generation, strict validation, and extensive test coverage. ## Changes * **Build System & Code Generation**: * Added Hatch build hook in [pack_specs_hook.py](file:///Users/gspencer/code/a2ui/main/agent_sdks/python/a2ui_agent/pack_specs_hook.py) to automatically compile the ANTLR grammar [Express.g4](file:///Users/gspencer/code/a2ui/main/agent_sdks/python/a2ui_agent/src/a2ui/experimental/express/Express.g4) into Python3 source files at build-time. * The build hook handles target case-insensitive file renaming to clean snake_case (`express_lexer.py`, `express_parser.py`, `express_visitor.py`), relative import post-processing, and automatic formatting of generated code with `pyink`. * Updated [pyproject.toml](file:///Users/gspencer/code/a2ui/main/agent_sdks/python/a2ui_agent/pyproject.toml) to include `antlr4-python3-runtime` as a runtime dependency, and `antlr4-tools` in the build system requirements. * **Compiler & Parser** (`a2ui.experimental.express`): * Implemented an ANTLR-based parsing pipeline using `Express.g4` to parse line-oriented declarative layout files. * The `ExpressCompiler` compiles the AST directly into standard A2UI v1.0 JSON payloads (with dynamic positional parameter resolution and variable flattening). * Supports rich string types: standard strings, raw strings (`r"..."`), raw multiline strings (`r"""..."""`), and escaped carriage returns. * Integrates a partial parser mode supporting streaming recovery for incomplete layouts. * Incorporates strict enum validation for component properties, raising `ValueError` on mismatch rather than silently ignoring invalid values. * **Decompiler**: * Implemented `ExpressDecompiler` to convert standard A2UI v1.0 JSON payloads back into the highly compact, line-oriented Express DSL. * **Schema Helper & Prompt Generator**: * Implemented `CatalogSchemaHelper` to parse catalog schema definitions. * Implemented `ExpressPromptGenerator` to compile active catalog schemas into positional signatures used to prompt generative models. * **Evaluation & Testing Scripts**: * Added `run_inference.py` to evaluate the A2UI Express prompt contract by converting JSON examples to Express DSL via Gemini/Ollama/MLX models and validating the round-trip compilation. * Added `recreate_dsl_examples.py` to programmatically regenerate the dynamic markdown documentation. * **Documentation & Examples**: * Added comprehensive layout examples under `specification/proposals/express/examples/*.a2ui` (36 files) along with their corresponding compiled JSON targets. * Created [README.md](file:///Users/gspencer/code/a2ui/main/specification/proposals/express/README.md) and [a2ui_express.md](file:///Users/gspencer/code/a2ui/main/specification/proposals/express/a2ui_express.md) detailing the DSL grammar, compiler mechanics, and usage. * Created [express_dsl_examples.md](file:///Users/gspencer/code/a2ui/main/specification/proposals/express/express_dsl_examples.md) detailing the active system prompt contract and compiled weather forecast examples. ## Impact & Risks * The feature is fully experimental, contained in the `a2ui.experimental.express` namespace, and gated behind the `A2UI_EXPRESS_ENABLED=true` environment variable. * There is no impact on stable production paths or other existing SDK modules. * Build-time code generation introduces a dependency on `antlr4` (via `antlr4-tools` and `antlr4-python3-runtime`) during development/builds, which is automatically resolved by Hatch and standard pip/uv environments. ## Testing * Added 44 robust unit tests under `agent_sdks/python/a2ui_agent/tests/express/` including: * [test_compiler.py](file:///Users/gspencer/code/a2ui/main/agent_sdks/python/a2ui_agent/tests/express/test_compiler.py): Verifies parser correctness, token parsing, raw string handling, and carriage return unescaping. * [test_decompiler.py](file:///Users/gspencer/code/a2ui/main/agent_sdks/python/a2ui_agent/tests/express/test_decompiler.py): Validates round-trip integrity (JSON -> Express -> JSON). * [test_integration.py](file:///Users/gspencer/code/a2ui/main/agent_sdks/python/a2ui_agent/tests/express/test_integration.py): Tests the compiler against all 36 catalog layout examples. * [test_cli_tools.py](file:///Users/gspencer/code/a2ui/main/agent_sdks/python/a2ui_agent/tests/express/test_cli_tools.py): Tests script interfaces and prompt generation. * The tests can be executed via the standard Dart/Python test runners (e.g. `uv run pytest`).
Summary
This pull request introduces A2UI Express, an experimental, compact domain-specific language (DSL) that allows agents to express UI layouts with minimal token usage. It provides a complete compilation/decompilation pipeline, integrates it into the evaluations strategy suite, and documents the specification draft under proposals.
Evaluation Results
Below is a summary of the evaluations benchmark run (47 samples) comparing the layout generation strategies using the lightweight
google/gemini-3.1-flash-litemodel:a2ui_scorerAccuracy (Syntax)measured_model_graded_qaAccuracy (Semantics)direct(Raw JSON)express(A2UI Express DSL)By compiling into inline components and inline data models, the Express strategy matches or exceeds the direct strategy's syntactic validity while reducing generation latencies by 70% (over 3.3x faster) and saving 56% of total generation tokens (and 72% of output tokens).
Changes
Python Agent SDK
agent_sdks/python/a2ui_agent/src/a2ui/express/: Added the core package including:compiler.py: Compiles flat Express DSL statements into standard A2UI wire JSON. Includes robust exception handling and positional argument validation.$path token resolving to relative root JSONPointer{"path": ""}.?and([?required, ?email])).Row([Text("Soup")])->_inline_1).#and JS-style//comments outside string literals.decompiler.py: Translates standard wire JSON layouts back into the flat DSL representation, with safety guards on missing properties.prompt_generator.py: Generates LLM system instruction prompt contracts containing component signatures, enums, static property indicators, and nested schema requirements dynamically.schema_helper.py: Crawls catalog JSON schemas to resolve positional parameter bounds and properties, with type checks to skip boolean schemas insideallOfloops.agent_sdks/python/a2ui_core/: Improved core validation routines to make them schema-driven:node_graph.py&integrity_checker.py: Decoupled structural and static checks from hardcoded catalog definitions, resolving constraints dynamically from catalog schemas.A2UI_EXPRESS_ENABLED: If not set totrue, importinga2ui.expressraises anImportError.A2UI_VERSION_1_0: Gates the new dynamic JSON schema validator for v1.0. Automatically enabled ifA2UI_EXPRESS_ENABLEDis true.tests/express/test_express.pycovering DSL tokenization, parsing, expression nesting, map variable inlining, comments, and validator gating. Addedtests/express/test_cli_tools.pyto test all 4 CLI wrapper scripts underspecification/proposals/express/, achieving 90% test coverage for the CLI utility suite.Specification & Proposals
specification/proposals/express/: Relocated and formalized the post-1.0 draft specification (a2ui_express.md,evolve_express.md, and sample configs) to prevent polluting the ratifiedv1_0baseline..a2uievaluation layout examples tospecification/proposals/express/examples/.Evaluations Suite
eval/a2ui_eval/strategies/express.py: Added theexpressevaluation solver that runs and merges layout accuracy scores. Express compiler results generate standard A2UI v1.0 JSON payloads directly.eval/tasks.py/main.py: Shifted the core baseline evaluation task to run against A2UI v0.9.1 (renamed toa2ui_v0_9_1_eval), while keeping theexpressstrategy target isolated to A2UI v1.0.eval/datasets/v1_0_prompts.yamlthat adapts prompt texts and target descriptions to use v1.0 terminology (inline components,longTextinstead ofmultiline, and omission ofreturnTypeproperties). This aligns target outputs with strategy versions for clean, accurate LLM grading.Impact & Risks
A2UI_EXPRESS_ENABLEDenvironment switch.direct,subagent_tool) are preserved as defaults.Testing
uv run pytestinsideagent_sdks/python/a2ui_agent/folder.uv run main.py --strategies expressinside theeval/folder.