Skip to content

MobileDev-Bench/mobile-swe-agent

Repository files navigation

mobile-swe-agent

A mobile-platform SWE agent built on top of mini-swe-agent, adapted to work with the mobiledev-bench harness for Android (Kotlin/Java), Flutter (Dart), and React Native (TypeScript/JavaScript) benchmarks.


Project Layout

mobile-swe-agent/
├── minisweagent/
│   ├── agents/
│   ├── environments/
│   ├── models/
│   ├── config/
│   │   ├── mini.yaml
│   │   ├── default.yaml
│   │   └── benchmarks/
│   └── run/
├── mobiledev_bench/
│   ├── harness/
│   └── utils/
├── eval_runner/
│   └── eval/
├── data/
├── docs/
├── tests/
├── results/
├── run_agent_eval.py
├── pyproject.toml
├── .env.example
└── .gitignore
  • minisweagent/: local mini-swe-agent copy and platform benchmark configs.
  • mobiledev_bench/: local MobileDev-Bench harness copy.
  • eval_runner/: glue code for dataset loading, GHCR images, prompts, validation, and patch capture.
  • run_agent_eval.py: Stage 2 CLI for generating patches.jsonl.

Installation

cd mobile-swe-agent
pip install -e ".[dev]"

cp .env.example .env
# Open .env and add your OPENROUTER_API_KEY

Dataset

Place your dataset JSONL under data/. See data/README.md for the full format spec.

The script accepts both formats transparently:

  • Dataset JSONL — produced by build_dataset.py; includes baseline test results (run_result, test_patch_result, fix_patch_result) from prior harness runs. This is the normal input.
  • Raw PullRequest JSONL — just the benchmark instance fields, without result data.

With GHCR images (--use_remote) you only need the JSONL file — no local image build step.


Three-Stage Pipeline

Stage 1 — Build Docker images (skip if using GHCR)

Pre-built images for all instances are hosted at github.com/orgs/MobileDev-Bench/packages. Pass --use_remote in Stage 2 to pull them automatically — no local build needed.

To build locally instead:

python -m mobiledev_bench.harness.build_dataset \
    --mode      image \
    --workdir   ./workdir \
    --dataset_files data/dataset.jsonl \
    --repo_dir  ./repos

Stage 2 — Run agent evaluation

python run_agent_eval.py \
    --dataset    data/dataset.jsonl \
    --output     results/<run-id>/patches.jsonl \
    --use_remote \
    --workers    4 \
    --instance_timeout 1800

Each instance runs inside its harness-built Docker image (pulled from GHCR or built locally). The agent edits source files; once it submits, git diff HEAD is captured as the fix patch and written to the output JSONL.

Trajectories (full message history + cost metadata) are saved alongside the patches under results/<run>/trajectories/.

Smoke tests

Use one worker for smoke tests so the logs and trajectories are easy to inspect. These commands exercise one representative instance per supported mobile stack, pull the benchmark images from GHCR automatically, and cap each smoke instance at 30 minutes with --instance_timeout 1800:

# Flutter / Dart
python run_agent_eval.py \
  --dataset data/dataset.jsonl \
  --output results/smoke-flutter-qwen/patches.jsonl \
  --use_remote \
  --workers 1 \
  --instance_timeout 1800 \
  --specifics "PalisadoesFoundation/talawa:pr-2469"

# Android / Kotlin
python run_agent_eval.py \
  --dataset data/dataset.jsonl \
  --output results/smoke-kotlin-qwen/patches.jsonl \
  --use_remote \
  --workers 1 \
  --instance_timeout 1800 \
  --specifics "commons-app/apps-android-commons:pr-6324"

# React Native / TypeScript
python run_agent_eval.py \
  --dataset data/dataset.jsonl \
  --output results/smoke-rn-qwen/patches.jsonl \
  --use_remote \
  --workers 1 \
  --instance_timeout 1800 \
  --specifics "NMF-earth/nmf-app:pr-424"

After each run, inspect the corresponding trajectory JSON under results/<smoke-run>/trajectories/ and the post-run patch flags in results/<smoke-run>/patch_flags.jsonl.

All flags

Flag Default Description
--dataset (required) Path to dataset JSONL
--output (required) Output Patch JSONL path
--model (from YAML) Model string — overrides model_name in platform YAML
--workers 1 Parallel Docker containers
--use_remote off Pull images from GHCR; delete after each run
--ghcr_username MobileDev-Bench GHCR org that hosts the images
--specifics (all) Run only instances whose ID matches (e.g. "PaulWoitaschek/Voice:pr-42")
--instance_timeout 0 (off) Hard wall-clock limit per instance in seconds (see Timeouts)

Stage 3 — Score patches

Create the evaluator scratch and output directories first; run_evaluation.py expects the --workdir path to already exist.

mkdir -p results/<run-id>/workdir results/<run-id>/eval results/<run-id>/eval-logs

python -m mobiledev_bench.harness.run_evaluation \
    --mode              evaluation \
    --workdir           results/<run-id>/workdir \
    --patch_files       results/<run-id>/patches.jsonl \
    --dataset_files     data/dataset.jsonl \
    --output_dir        results/<run-id>/eval \
    --log_dir           results/<run-id>/eval-logs \
    --use_remote_images true \
    --ghcr_username     MobileDev-Bench \
    --human_mode        true \
    --max_workers       4

The evaluator mounts each agent patch at /home/fix.patch inside the pre-built image and runs bash /home/fix-run.sh, which applies the test patch + fix patch, executes the test_command, and parses the XML test results. --mode evaluation also generates results/<run-id>/eval/final_report.json; --mode instance_only only runs the containers and leaves per-instance logs under results/<run-id>/workdir/.../evals/. --human_mode true means the patch comes from an external patch file, such as this agent's patches.jsonl.


Model Selection

Models are resolved in this priority order:

1. --model CLI flag                    (highest priority)
2. model_name in platform YAML         (per-platform default)
3. MSWEA_MODEL_NAME environment var

The platform YAML configs default to openrouter/qwen/qwen3-coder. To switch models for a run, pass --model:

# Qwen3-Coder via OpenRouter (default)
python run_agent_eval.py --dataset data/dataset.jsonl --output results/qwen3/patches.jsonl --use_remote

# Claude Sonnet via OpenRouter
python run_agent_eval.py --dataset data/dataset.jsonl --output results/claude/patches.jsonl \
    --model openrouter/anthropic/claude-sonnet-4-5 --use_remote

# Claude directly via Anthropic API (needs ANTHROPIC_API_KEY)
python run_agent_eval.py --dataset data/dataset.jsonl --output results/claude/patches.jsonl \
    --model anthropic/claude-sonnet-4-5 --use_remote

To change the default for all runs, edit model_name at the top of the relevant YAML file(s) in minisweagent/config/benchmarks/.

Model routing

Model string prefix Routes to Cost tracking
openrouter/… OpenRouterModel (direct HTTPS to OpenRouter API) Exact — reads usage.cost from API response
anything else LitellmModel (via LiteLLM abstraction) Estimated via LiteLLM pricing table

Use the openrouter/ prefix for all OpenRouter models. The prefix is stripped before the model name is sent to the OpenRouter API (e.g. openrouter/qwen/qwen3-coder → API receives qwen/qwen3-coder).


Platform Configs

minisweagent/config/benchmarks/ contains one YAML per platform. Each controls the system prompt, instance template, model default, cost/step limits, per-command timeout, and environment variables.

Platform Config Per-command timeout Cost limit Step limit Key env vars
Android (Kotlin/Java) android.yaml 1200 s $1.00 50 GRADLE_OPTS=-Dorg.gradle.daemon=false
Flutter (Dart) flutter.yaml 600 s $1.00 50
React Native (TS/JS) react-native.yaml 600 s $1.00 50 NODE_OPTIONS, CI=true

Timeouts explained

There are two independent timeout controls:

  • Per-command timeout (environment.timeout in YAML) — the deadline for each individual bash command the agent issues. A single slow Gradle build won't kill the instance, just that one command if it overruns. Android gets 1200 s to handle cold Gradle builds; Flutter/RN get 600 s.
  • Per-instance wall-clock timeout (--instance_timeout N CLI flag, default off) — a hard end-to-end limit for the whole instance. If hit, the partial git diff is still collected and written. Only reliable with --workers 1 (uses SIGALRM).

The primary brakes in normal operation are cost_limit: 1.00 (approximately $1 per instance) and step_limit: 50 (approximately 50 LLM calls), whichever is hit first.


Environment Variables

Variable Required Description
OPENROUTER_API_KEY Yes (OpenRouter models) OpenRouter API key
ANTHROPIC_API_KEY Yes (direct Anthropic) Anthropic API key
MSWEA_MODEL_NAME No Fallback model if not set in YAML or CLI
MSWEA_SILENT_STARTUP Auto-set Suppresses mini-swe-agent startup banner (set automatically by run_agent_eval.py)
MSWEA_COST_TRACKING No Set to ignore_errors to silence cost tracking warnings
MSWEA_DOCKER_EXECUTABLE No Override docker binary path

Copy .env.example to .env and fill in your keys. The project .env is loaded automatically before any model or agent code runs.


Extending

Add a new repo/platform: create a file under mobiledev_bench/harness/repos/, define Image subclasses and an Instance subclass decorated with @Instance.register(org, repo). Add a YAML config under minisweagent/config/benchmarks/ if the platform needs different prompts or timeouts.

Use a different model: either pass --model <string> at the CLI, or edit model_name in the relevant platform YAML.

Customise agent behaviour: subclass DefaultAgent in minisweagent/agents/ and instantiate it in run_agent_eval.py instead of DefaultAgent.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors