Skip to content

stuinfla/paperbanana

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

PaperBanana 🍌

AI-powered academic diagrams that actually communicate ideas — not just labeled boxes.

For researchers, students, and anyone who needs publication-quality figures from text descriptions.

10x faster | 20x cheaper | 95.8/100 avg quality (self-evaluated via vision critic)

Paper page on HF Dataset on HF


Quick Start - 3 ways to get started

Option A: Claude Code Skill (Recommended — no install needed)

mkdir -p ~/.claude/skills/paperbanana
curl -sL https://github.com/stuinfla/paperbanana/main/docs/SKILL.md > ~/.claude/skills/paperbanana/SKILL.md

Type /paperbanana in Claude Code and start describing diagrams. That's it.

Option B: MCP Server (for Claude Code / AI assistants)

Quick setup: curl -sL https://github.com/stuinfla/paperbanana/main/install.sh | bash

Then add to your project's .mcp.json:

{
  "mcpServers": {
    "paperbanana": {
      "command": "python3",
      "args": ["-m", "mcp_server.server"],
      "cwd": "~/paperbanana",
      "env": { "GOOGLE_API_KEY": "your-key-here" }
    }
  }
}

Get a free API key at aistudio.google.com/apikey, then ask Claude: "Generate a diagram showing how neural networks learn"

Option C: Command Line

git clone https://github.com/stuinfla/paperbanana && cd paperbanana
python3 -m venv .venv && .venv/bin/pip install -r requirements.txt
export GOOGLE_API_KEY="your-key"  # Free at https://aistudio.google.com/apikey
.venv/bin/python cli_generate.py --content "Your concept" --caption "Figure 1" --output diagram.png

Note: These are this fork's Skill and MCP (SVG pipeline + visual storytelling). Different from the community pip install paperbanana package — see Community Supports.


Here are 3 examples generated by this fork's SVG pipeline (95+ each):

Pi: Data Flow Pipeline RuVector: GNN Learning Loop Ruflo: Swarm Topologies

What's Different About This Fork

Original research by Dawei Zhu, Rui Meng, Yale Song, Xiyu Wei, Sujian Li, Tomas Pfister, Jinsung Yoon (Peking University + Google Cloud AI Research). Originally open-sourced as PaperVizAgent.

Enhanced by Stuart Kerr — SVG pipeline, visual storytelling, vision critic, 10x speed, 20x cost reduction.

Original This Fork
Rendering Raster image generation SVG code + Cairo render (100% text fidelity)
Speed 2-5 min per diagram ~30s in SVG mode
Cost $0.50-2.00 per diagram ~$0.05 in SVG mode
Quality ~62-72/100 avg 95.8/100 avg (self-evaluated)
Design Labeled boxes and arrows Visual-first: icons, shapes, spatial layout
Self-correction Text-based critic Vision critic sees the rendered PNG, sends spatial fixes
Output Raster PNG only Editable SVG + PNG
Integration Streamlit only Skill, MCP, CLI, Streamlit, Python API

The core innovation: instead of drawing labeled boxes, the Planner first asks "What is this concept LIKE?" — finding a visual metaphor that makes the diagram click in seconds.


Full Install

For CLI, Streamlit, Python API, or MCP server:

git clone https://github.com/stuinfla/paperbanana && cd paperbanana
python3 -m venv .venv && .venv/bin/pip install -r requirements.txt
export GOOGLE_API_KEY="your-key-here"  # Free at https://aistudio.google.com/apikey

Command Line

# Generate a diagram
.venv/bin/python cli_generate.py \
  --content "Description of your concept" \
  --caption "Figure 1: What this shows" \
  --output diagram.png

# From a file
.venv/bin/python cli_generate.py \
  --content-file method_section.md \
  --caption "Figure 2: System architecture." \
  --output diagram.png

# Multiple candidates (picks the best)
.venv/bin/python cli_generate.py \
  --content-file method_section.md \
  --caption "Figure 1" \
  --output diagram.png \
  --candidates 5

Web UI (Streamlit)

.venv/bin/streamlit run demo.py
Python API
import asyncio
from utils.paperviz_processor import PaperVizProcessor
from utils import config
from agents.planner_agent import PlannerAgent
from agents.visualizer_agent import VisualizerAgent
from agents.stylist_agent import StylistAgent
from agents.critic_agent import CriticAgent
from agents.retriever_agent import RetrieverAgent
from agents.vanilla_agent import VanillaAgent
from agents.polish_agent import PolishAgent

exp_config = config.ExpConfig(
    dataset_name="Demo", task_name="diagram", split_name="demo",
    exp_mode="demo_full", retrieval_setting="auto", max_critic_rounds=3,
)

processor = PaperVizProcessor(
    exp_config=exp_config,
    planner_agent=PlannerAgent(exp_config=exp_config),
    visualizer_agent=VisualizerAgent(exp_config=exp_config),
    stylist_agent=StylistAgent(exp_config=exp_config),
    critic_agent=CriticAgent(exp_config=exp_config),
    retriever_agent=RetrieverAgent(exp_config=exp_config),
    vanilla_agent=VanillaAgent(exp_config=exp_config),
    polish_agent=PolishAgent(exp_config=exp_config),
)

input_data = {
    "filename": "my_diagram",
    "caption": "Figure 1: System architecture.",
    "content": "Your methodology text here...",
    "visual_intent": "Figure 1: System architecture.",
}

async def generate():
    async for result in processor.process_queries_batch(
        [input_data], max_concurrent=1, do_eval=False
    ):
        print("Generated:", result.keys())

asyncio.run(generate())
CLI Reference
Flag Values Default Description
--content text required* Inline content to visualize
--content-file path required* File containing content
--caption text required Figure caption / visual intent
--output path output.png Output image path
--task diagram, plot diagram Type of visualization
--mode demo_full, demo_planner_critic, vanilla demo_full Pipeline mode
--retrieval auto, manual, random, none none Reference retrieval strategy
--critic-rounds 1-5 3 Max refinement iterations
--candidates 1-20 1 Parallel candidates to generate
--aspect-ratio 16:9, 21:9, 3:2 16:9 Output aspect ratio
--quiet flag false Suppress progress output
--model model name config Override reasoning model
--image-model model name config Override image generation model

*One of --content or --content-file is required.

Models and Cost
Model Role Quality Speed Cost
gemini-3.1-pro-preview Reasoning Best ~15s/call Higher
gemini-3-pro-image-preview Image Gen Best ~60s/call Higher
gemini-2.5-flash Reasoning Good ~5s/call Lower
gemini-2.5-flash-image Image Gen Good ~30s/call Lower

Switch models: --model gemini-2.5-flash --image-model gemini-2.5-flash-image

Or set permanently in configs/model_config.yaml:

defaults:
  model_name: "gemini-2.5-flash"
  image_model_name: "gemini-2.5-flash-image"
Quality Tier Cost/image Time
SVG pipeline (best) ~$0.05 ~30s
Full raster pipeline ~$0.10 ~2 min
Draft (vanilla) ~$0.02 ~90s

Optional: download PaperBananaBench into data/PaperBananaBench/ for +15 quality points via reference retrieval.


How It Works

Visual Storytelling Pipeline

Before drawing anything, the Planner asks three questions:

Standard vs Storytelling approach comparison

A container format becomes a shipping crate with compartments. A self-learning database becomes a living library. The reviewer "gets it" in seconds.

SVG Pipeline: Vision Critic Loop

The highest-quality mode. LLM writes SVG code directly, Cairo renders to PNG, multimodal Gemini evaluates the rendered image, and spatial fixes are applied automatically.

  1. SVG Generation -- LLM writes SVG with labels and descriptions on every element
  2. Cairo Rendering -- 100% text fidelity (text placed by renderer, not predicted by a neural net)
  3. Vision Critique -- evaluates for overlap, clipping, layout, missing information
  4. Self-Correction -- fixes applied, re-rendered, re-evaluated until 95+/100

Side-by-Side: Before and After

Technical Architecture (+30 points)

With Visual Metaphor (95/100) Standard Pipeline (65/100)
Enhanced Standard

WiFi sensing is invisible. The metaphor -- waves passing through a person with a pose overlay -- makes the invisible visible.

Abstract Concepts (+16 points)

With Visual Metaphor (92/100) Standard Pipeline (76/100)
Enhanced Standard

The "Knowledge City" metaphor turns abstract ideas into something tangible.

More examples

Application Ecosystem (+25)

With Visual Metaphor (93/100) Standard Pipeline (68/100)
Enhanced Standard

Product Overview (+16)

With Visual Metaphor (94/100) Standard Pipeline (78/100)
Enhanced Standard

Quality Results

Quality score comparison chart
Scenario Enhanced Baseline Gain
Application ecosystem 93 68 +25
Product overview 94 78 +16
Technical architecture 95 65 +30
Abstract concepts 92 76 +16
Average 93.5 71.75 +21.75

Architecture

Enhanced pipeline flow diagram

The original 5-agent pipeline (all enhanced in this fork) plus the new SVG Visualizer:

Agent Enhancement
Planner Mandatory visual metaphor discovery before describing elements
Stylist Preserves metaphors (never flattens into generic boxes)
Visualizer Multi-candidate parallel generation + tag stripping
Critic 7 mandatory visual excellence checks, strict pass threshold
SVG Visualizer (new) LLM writes SVG + Cairo render + multimodal vision critic loop
Quality journey across iterations
Quality score progression chart

Each iteration built on the one before. The storytelling step produced the largest single improvement because it changes the strategy rather than just the execution.

PaperBanana Framework

Advanced: Batch Evaluation and Visualization
# Batch evaluation
python main.py \
  --dataset_name "PaperBananaBench" \
  --task_name "diagram" \
  --split_name "test" \
  --exp_mode "dev_full" \
  --retrieval_setting "auto"

# Pipeline evolution viewer
streamlit run visualize/show_pipeline_evolution.py

# Evaluation results
streamlit run visualize/show_referenced_eval.py

Modes: vanilla, dev_planner, dev_planner_stylist, dev_planner_critic, dev_full, demo_planner_critic, demo_full


Project Structure

Project Structure

ASCII Version (for AI/accessibility)
agents/
  planner_agent.py        # Visual metaphor discovery + description
  stylist_agent.py         # Metaphor-preserving style refinement
  visualizer_agent.py      # Multi-candidate image generation
  svg_visualizer_agent.py  # SVG code gen + Cairo render + vision critic
  critic_agent.py          # 7-check visual excellence scoring
  retriever_agent.py       # Reference example retrieval
  vanilla_agent.py         # Direct generation (baseline)
  polish_agent.py          # Post-processing refinement
mcp_server/
  server.py                # MCP server (4 tools) for AI assistants
cli_generate.py            # Headless CLI for single diagram generation
demo.py                    # Streamlit web UI
main.py                    # Batch evaluation runner
configs/                   # Model config templates
data/PaperBananaBench/     # Reference dataset (download separately)
docs/                      # Comparison images, showcase gallery
style_guides/              # NeurIPS aesthetic guidelines
utils/                     # Pipeline orchestration, config
visualize/                 # Pipeline visualization tools

Community Supports

Note: The community projects below are independent implementations of the original PaperBanana paper -- not related to the Skill or MCP in this fork. If you installed from Quick Start above, you're using this fork's SVG pipeline.

Related work in automated academic illustration:

We warmly welcome community contributions to make PaperBanana even better!

License

Apache-2.0

Citation

@article{zhu2026paperbanana,
  title={PaperBanana: Automating Academic Illustration for AI Scientists},
  author={Zhu, Dawei and Meng, Rui and Song, Yale and Wei, Xiyu and Li, Sujian and Pfister, Tomas and Yoon, Jinsung},
  journal={arXiv preprint arXiv:2601.23265},
  year={2026}
}

Disclaimer

This is not an officially supported Google product. This project is not eligible for the Google Open Source Software Vulnerability Rewards Program.

Our goal is simply to benefit the community, so currently we have no plans to use it for commercial purposes. The core methodology was developed during my internship at Google, and patents have been filed for these specific workflows by Google. While this doesn't impact open-source research efforts, it restricts third-party commercial applications using similar logic.

About

PaperBanana: Automating Academic Illustration For AI Scientists

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 51.9%
  • Jupyter Notebook 25.4%
  • JavaScript 17.5%
  • HTML 4.1%
  • Shell 0.6%
  • CSS 0.4%
  • Dockerfile 0.1%