AI-powered academic diagrams that actually communicate ideas — not just labeled boxes.
For researchers, students, and anyone who needs publication-quality figures from text descriptions.
10x faster | 20x cheaper | 95.8/100 avg quality (self-evaluated via vision critic)
mkdir -p ~/.claude/skills/paperbanana
curl -sL https://github.com/stuinfla/paperbanana/main/docs/SKILL.md > ~/.claude/skills/paperbanana/SKILL.mdType /paperbanana in Claude Code and start describing diagrams. That's it.
Quick setup: curl -sL https://github.com/stuinfla/paperbanana/main/install.sh | bash
Then add to your project's .mcp.json:
{
"mcpServers": {
"paperbanana": {
"command": "python3",
"args": ["-m", "mcp_server.server"],
"cwd": "~/paperbanana",
"env": { "GOOGLE_API_KEY": "your-key-here" }
}
}
}Get a free API key at aistudio.google.com/apikey, then ask Claude: "Generate a diagram showing how neural networks learn"
git clone https://github.com/stuinfla/paperbanana && cd paperbanana
python3 -m venv .venv && .venv/bin/pip install -r requirements.txt
export GOOGLE_API_KEY="your-key" # Free at https://aistudio.google.com/apikey
.venv/bin/python cli_generate.py --content "Your concept" --caption "Figure 1" --output diagram.pngNote: These are this fork's Skill and MCP (SVG pipeline + visual storytelling). Different from the community
pip install paperbananapackage — see Community Supports.
Here are 3 examples generated by this fork's SVG pipeline (95+ each):
| Pi: Data Flow Pipeline | RuVector: GNN Learning Loop | Ruflo: Swarm Topologies |
|---|---|---|
Original research by Dawei Zhu, Rui Meng, Yale Song, Xiyu Wei, Sujian Li, Tomas Pfister, Jinsung Yoon (Peking University + Google Cloud AI Research). Originally open-sourced as PaperVizAgent.
Enhanced by Stuart Kerr — SVG pipeline, visual storytelling, vision critic, 10x speed, 20x cost reduction.
| Original | This Fork | |
|---|---|---|
| Rendering | Raster image generation | SVG code + Cairo render (100% text fidelity) |
| Speed | 2-5 min per diagram | ~30s in SVG mode |
| Cost | $0.50-2.00 per diagram | ~$0.05 in SVG mode |
| Quality | ~62-72/100 avg | 95.8/100 avg (self-evaluated) |
| Design | Labeled boxes and arrows | Visual-first: icons, shapes, spatial layout |
| Self-correction | Text-based critic | Vision critic sees the rendered PNG, sends spatial fixes |
| Output | Raster PNG only | Editable SVG + PNG |
| Integration | Streamlit only | Skill, MCP, CLI, Streamlit, Python API |
The core innovation: instead of drawing labeled boxes, the Planner first asks "What is this concept LIKE?" — finding a visual metaphor that makes the diagram click in seconds.
For CLI, Streamlit, Python API, or MCP server:
git clone https://github.com/stuinfla/paperbanana && cd paperbanana
python3 -m venv .venv && .venv/bin/pip install -r requirements.txt
export GOOGLE_API_KEY="your-key-here" # Free at https://aistudio.google.com/apikey# Generate a diagram
.venv/bin/python cli_generate.py \
--content "Description of your concept" \
--caption "Figure 1: What this shows" \
--output diagram.png
# From a file
.venv/bin/python cli_generate.py \
--content-file method_section.md \
--caption "Figure 2: System architecture." \
--output diagram.png
# Multiple candidates (picks the best)
.venv/bin/python cli_generate.py \
--content-file method_section.md \
--caption "Figure 1" \
--output diagram.png \
--candidates 5.venv/bin/streamlit run demo.pyPython API
import asyncio
from utils.paperviz_processor import PaperVizProcessor
from utils import config
from agents.planner_agent import PlannerAgent
from agents.visualizer_agent import VisualizerAgent
from agents.stylist_agent import StylistAgent
from agents.critic_agent import CriticAgent
from agents.retriever_agent import RetrieverAgent
from agents.vanilla_agent import VanillaAgent
from agents.polish_agent import PolishAgent
exp_config = config.ExpConfig(
dataset_name="Demo", task_name="diagram", split_name="demo",
exp_mode="demo_full", retrieval_setting="auto", max_critic_rounds=3,
)
processor = PaperVizProcessor(
exp_config=exp_config,
planner_agent=PlannerAgent(exp_config=exp_config),
visualizer_agent=VisualizerAgent(exp_config=exp_config),
stylist_agent=StylistAgent(exp_config=exp_config),
critic_agent=CriticAgent(exp_config=exp_config),
retriever_agent=RetrieverAgent(exp_config=exp_config),
vanilla_agent=VanillaAgent(exp_config=exp_config),
polish_agent=PolishAgent(exp_config=exp_config),
)
input_data = {
"filename": "my_diagram",
"caption": "Figure 1: System architecture.",
"content": "Your methodology text here...",
"visual_intent": "Figure 1: System architecture.",
}
async def generate():
async for result in processor.process_queries_batch(
[input_data], max_concurrent=1, do_eval=False
):
print("Generated:", result.keys())
asyncio.run(generate())CLI Reference
| Flag | Values | Default | Description |
|---|---|---|---|
--content |
text | required* | Inline content to visualize |
--content-file |
path | required* | File containing content |
--caption |
text | required | Figure caption / visual intent |
--output |
path | output.png | Output image path |
--task |
diagram, plot | diagram | Type of visualization |
--mode |
demo_full, demo_planner_critic, vanilla | demo_full | Pipeline mode |
--retrieval |
auto, manual, random, none | none | Reference retrieval strategy |
--critic-rounds |
1-5 | 3 | Max refinement iterations |
--candidates |
1-20 | 1 | Parallel candidates to generate |
--aspect-ratio |
16:9, 21:9, 3:2 | 16:9 | Output aspect ratio |
--quiet |
flag | false | Suppress progress output |
--model |
model name | config | Override reasoning model |
--image-model |
model name | config | Override image generation model |
*One of --content or --content-file is required.
Models and Cost
| Model | Role | Quality | Speed | Cost |
|---|---|---|---|---|
gemini-3.1-pro-preview |
Reasoning | Best | ~15s/call | Higher |
gemini-3-pro-image-preview |
Image Gen | Best | ~60s/call | Higher |
gemini-2.5-flash |
Reasoning | Good | ~5s/call | Lower |
gemini-2.5-flash-image |
Image Gen | Good | ~30s/call | Lower |
Switch models: --model gemini-2.5-flash --image-model gemini-2.5-flash-image
Or set permanently in configs/model_config.yaml:
defaults:
model_name: "gemini-2.5-flash"
image_model_name: "gemini-2.5-flash-image"| Quality Tier | Cost/image | Time |
|---|---|---|
| SVG pipeline (best) | ~$0.05 | ~30s |
| Full raster pipeline | ~$0.10 | ~2 min |
| Draft (vanilla) | ~$0.02 | ~90s |
Optional: download PaperBananaBench into data/PaperBananaBench/ for +15 quality points via reference retrieval.
Before drawing anything, the Planner asks three questions:
A container format becomes a shipping crate with compartments. A self-learning database becomes a living library. The reviewer "gets it" in seconds.
The highest-quality mode. LLM writes SVG code directly, Cairo renders to PNG, multimodal Gemini evaluates the rendered image, and spatial fixes are applied automatically.
- SVG Generation -- LLM writes SVG with labels and descriptions on every element
- Cairo Rendering -- 100% text fidelity (text placed by renderer, not predicted by a neural net)
- Vision Critique -- evaluates for overlap, clipping, layout, missing information
- Self-Correction -- fixes applied, re-rendered, re-evaluated until 95+/100
| With Visual Metaphor (95/100) | Standard Pipeline (65/100) |
|---|---|
![]() |
![]() |
WiFi sensing is invisible. The metaphor -- waves passing through a person with a pose overlay -- makes the invisible visible.
| With Visual Metaphor (92/100) | Standard Pipeline (76/100) |
|---|---|
![]() |
![]() |
The "Knowledge City" metaphor turns abstract ideas into something tangible.
More examples
| With Visual Metaphor (93/100) | Standard Pipeline (68/100) |
|---|---|
![]() |
![]() |
| With Visual Metaphor (94/100) | Standard Pipeline (78/100) |
|---|---|
![]() |
![]() |
| Scenario | Enhanced | Baseline | Gain |
|---|---|---|---|
| Application ecosystem | 93 | 68 | +25 |
| Product overview | 94 | 78 | +16 |
| Technical architecture | 95 | 65 | +30 |
| Abstract concepts | 92 | 76 | +16 |
| Average | 93.5 | 71.75 | +21.75 |
The original 5-agent pipeline (all enhanced in this fork) plus the new SVG Visualizer:
| Agent | Enhancement |
|---|---|
| Planner | Mandatory visual metaphor discovery before describing elements |
| Stylist | Preserves metaphors (never flattens into generic boxes) |
| Visualizer | Multi-candidate parallel generation + tag stripping |
| Critic | 7 mandatory visual excellence checks, strict pass threshold |
| SVG Visualizer (new) | LLM writes SVG + Cairo render + multimodal vision critic loop |
Quality journey across iterations
Each iteration built on the one before. The storytelling step produced the largest single improvement because it changes the strategy rather than just the execution.
Advanced: Batch Evaluation and Visualization
# Batch evaluation
python main.py \
--dataset_name "PaperBananaBench" \
--task_name "diagram" \
--split_name "test" \
--exp_mode "dev_full" \
--retrieval_setting "auto"
# Pipeline evolution viewer
streamlit run visualize/show_pipeline_evolution.py
# Evaluation results
streamlit run visualize/show_referenced_eval.pyModes: vanilla, dev_planner, dev_planner_stylist, dev_planner_critic, dev_full, demo_planner_critic, demo_full
ASCII Version (for AI/accessibility)
agents/
planner_agent.py # Visual metaphor discovery + description
stylist_agent.py # Metaphor-preserving style refinement
visualizer_agent.py # Multi-candidate image generation
svg_visualizer_agent.py # SVG code gen + Cairo render + vision critic
critic_agent.py # 7-check visual excellence scoring
retriever_agent.py # Reference example retrieval
vanilla_agent.py # Direct generation (baseline)
polish_agent.py # Post-processing refinement
mcp_server/
server.py # MCP server (4 tools) for AI assistants
cli_generate.py # Headless CLI for single diagram generation
demo.py # Streamlit web UI
main.py # Batch evaluation runner
configs/ # Model config templates
data/PaperBananaBench/ # Reference dataset (download separately)
docs/ # Comparison images, showcase gallery
style_guides/ # NeurIPS aesthetic guidelines
utils/ # Pipeline orchestration, config
visualize/ # Pipeline visualization tools
Note: The community projects below are independent implementations of the original PaperBanana paper -- not related to the Skill or MCP in this fork. If you installed from Quick Start above, you're using this fork's SVG pipeline.
- https://github.com/llmsresearch/paperbanana -- pip-installable package with its own MCP server (different from this fork's MCP)
- https://github.com/efradeca/freepaperbanana
Related work in automated academic illustration:
- https://github.com/ResearAI/AutoFigure-Edit
- https://github.com/OpenDCAI/Paper2Any
- https://github.com/BIT-DataLab/Edit-Banana
We warmly welcome community contributions to make PaperBanana even better!
Apache-2.0
@article{zhu2026paperbanana,
title={PaperBanana: Automating Academic Illustration for AI Scientists},
author={Zhu, Dawei and Meng, Rui and Song, Yale and Wei, Xiyu and Li, Sujian and Pfister, Tomas and Yoon, Jinsung},
journal={arXiv preprint arXiv:2601.23265},
year={2026}
}This is not an officially supported Google product. This project is not eligible for the Google Open Source Software Vulnerability Rewards Program.
Our goal is simply to benefit the community, so currently we have no plans to use it for commercial purposes. The core methodology was developed during my internship at Google, and patents have been filed for these specific workflows by Google. While this doesn't impact open-source research efforts, it restricts third-party commercial applications using similar logic.








