StructEval is a framework for evaluating language models on structured outputs, supporting rendering and evaluation of generated code.
# Create and activate the conda environment from the environment.yml file
conda create -n structeval python=3.12
conda activate structeval
# Install all required packages(required)
pip install -r requirements.txt
# Separately install the llm-engines(required for inference)
pip install git+https://github.com/jdf-prog/LLM-Engines.git
# Install playwright browsers (required for rendering)
playwright install
# You could also install the package in development mode(optional)
pip install -e .The following system packages will be installed automatically through conda:
ghostscriptandpoppler: Required for PDF processingnodejs: Required for Playwrightgraphviz: Required for visualizationimagemagick: Required for image processing
If you encounter any issues with system dependencies, you can install them manually using your system's package manager:
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install ghostscript poppler-utils nodejs graphviz imagemagick
# CentOS/RHEL
sudo yum install ghostscript poppler nodejs graphviz ImageMagick
# macOS
brew install ghostscript poppler node graphviz imagemagickStructEval provides a command-line interface for running inference, rendering, and evaluation.
# Run inference
python -m structeval.cli inference \
--llm_model_name "model_name" \
--llm_engine "engine_name" \
--input_path "path/to/input.json" \
--output_path "path/to/output.json"
# Render outputs
python -m structeval.cli render \
--input_path "path/to/inference_output.json" \
--img_output_path "path/to/rendered_images" \
--non_renderable_output_dir "path/to/non_renderable_files"
# Run evaluation
python -m structeval.cli evaluate \
--vlm_model_name "model_name" \
--vlm_engine "engine_name" \
--input_path "path/to/inference_output.json" \
--output_path "path/to/evaluation_output.json" \
--img_path "path/to/rendered_images" \
--non_renderable_output_dir "path/to/non_renderable_files"--llm_model_name: Name of the language model (e.g., "meta-llama/Llama-3.1-8B-Instruct", "gpt-4.1-mini")--llm_engine: Engine for running inference (e.g., "vllm", "openai")--input_path: Path to the input dataset JSON file--output_path: Path to save inference results
--input_path: Path to the inference output JSON--img_output_path: Directory to save rendered images--non_renderable_output_dir: Directory to save non-renderable outputs
--vlm_model_name: Name of the vision language model for evaluation (e.g., "gpt-4.1-mini")--vlm_engine: Engine for evaluation (e.g., "openai")--input_path: Path to the inference output JSON--output_path: Path to save evaluation results--img_path: Path to the directory containing rendered images--non_renderable_output_dir: Directory containing non-renderable outputs
The repository includes helper scripts for running full experiments:
Run inference across multiple models:
python -m structeval.run_inferenceRender outputs from inference results:
python -m structeval.run_renderEvaluate rendered outputs:
python -m structeval.run_evaluationThese scripts can be configured by editing the model lists and file paths within them.
The input JSON should be an array of task objects with the following structure:
[
{
"task_id": "000500",
"query": "Please output JSON code:\n\nTask:\n...",
"feature_requirements": "",
"task_name": "Text to JSON",
"input_type": "Text",
"output_type": "JSON",
"query_example": "",
"VQA": [],
"raw_output_metric": [
"novel.title",
"novel.author.name",
"novel.characters[0].name"
],
"rendering": false
}
]task_id: Unique identifier for the taskquery: The prompt sent to the modeltask_name: Name of the task (e.g., "Text to JSON", "Text to Angular")input_type: Type of input (e.g., "Text")output_type: Expected output format (e.g., "JSON", "Angular")VQA: Array of visual question-answer pairs for evaluating renderable outputsraw_output_metric: Keys or elements to check in the outputrendering: Boolean indicating if the output should be rendered visually
The inference process adds a generation field to each task in the input:
[
{
"task_id": "000500",
"query": "Please output JSON code:\n\nTask:\n...",
"feature_requirements": "",
"task_name": "Text to JSON",
"input_type": "Text",
"output_type": "JSON",
"query_example": "",
"VQA": [],
"raw_output_metric": [...],
"rendering": false,
"generation": "```json\n{\n \"novel\": {\n \"title\": \"The Obsidian Labyrinth\",\n \"author\": {\n \"name\": \"Anya Petrova\",\n \"birth_year\": 1978\n },\n ...\n }\n}\n```"
}
]The evaluation result contains additional scoring fields:
[
{
"task_id": "000500",
"query": "Please output JSON code:\n\nTask:\n...",
"feature_requirements": "",
"task_name": "Text to JSON",
"input_type": "Text",
"output_type": "JSON",
"VQA": [],
"raw_output_metric": [...],
"rendering": false,
"generation": "...",
"output_file": "experiment_results/model-name/non_renderable_files/000500.json",
"render_score": 1,
"VQA_score": null,
"key_validation_score": 1.0,
"final_eval_score": 1.0
}
]output_file: Path to the rendered output or extracted JSON filerender_score: Score indicating if the output was rendered successfully (0 or 1)VQA_score: Score from visual question-answering evaluation (for renderable outputs)key_validation_score: Score from validating expected keys in JSON output (for non-renderable outputs)raw_output_eval: Array of boolean values indicating whether each raw output metric was satisfiedraw_output_score: Score from the raw output evaluationfinal_eval_score: Overall evaluation score between 0 and 1
Please cite us with the following bibtex:
@misc{yang2025structeval,
title={StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs},
author={Jialin Yang and Dongfu Jiang and Lipeng He and Sherman Siu and Yuxuan Zhang and Disen Liao and Zhuofeng Li and Huaye Zeng and Yiming Jia and Haozhe Wang and Benjamin Schneider and Chi Ruan and Wentao Ma and Zhiheng Lyu and Yifei Wang and Yi Lu and Quy Duc Do and Ziyan Jiang and Ping Nie and Wenhu Chen},
year={2025},
eprint={2505.20139},
archivePrefix={arXiv},
primaryClass={cs.SE},
doi={10.48550/arXiv.2505.20139}
}