A mobile-platform SWE agent built on top of mini-swe-agent, adapted to work with the mobiledev-bench harness for Android (Kotlin/Java), Flutter (Dart), and React Native (TypeScript/JavaScript) benchmarks.
mobile-swe-agent/
├── minisweagent/
│ ├── agents/
│ ├── environments/
│ ├── models/
│ ├── config/
│ │ ├── mini.yaml
│ │ ├── default.yaml
│ │ └── benchmarks/
│ └── run/
├── mobiledev_bench/
│ ├── harness/
│ └── utils/
├── eval_runner/
│ └── eval/
├── data/
├── docs/
├── tests/
├── results/
├── run_agent_eval.py
├── pyproject.toml
├── .env.example
└── .gitignore
minisweagent/: local mini-swe-agent copy and platform benchmark configs.mobiledev_bench/: local MobileDev-Bench harness copy.eval_runner/: glue code for dataset loading, GHCR images, prompts, validation, and patch capture.run_agent_eval.py: Stage 2 CLI for generatingpatches.jsonl.
cd mobile-swe-agent
pip install -e ".[dev]"
cp .env.example .env
# Open .env and add your OPENROUTER_API_KEYPlace your dataset JSONL under data/. See data/README.md for the full format spec.
The script accepts both formats transparently:
- Dataset JSONL — produced by
build_dataset.py; includes baseline test results (run_result,test_patch_result,fix_patch_result) from prior harness runs. This is the normal input. - Raw PullRequest JSONL — just the benchmark instance fields, without result data.
With GHCR images (--use_remote) you only need the JSONL file — no local image build step.
Pre-built images for all instances are hosted at
github.com/orgs/MobileDev-Bench/packages.
Pass --use_remote in Stage 2 to pull them automatically — no local build needed.
To build locally instead:
python -m mobiledev_bench.harness.build_dataset \
--mode image \
--workdir ./workdir \
--dataset_files data/dataset.jsonl \
--repo_dir ./repospython run_agent_eval.py \
--dataset data/dataset.jsonl \
--output results/<run-id>/patches.jsonl \
--use_remote \
--workers 4 \
--instance_timeout 1800Each instance runs inside its harness-built Docker image (pulled from GHCR or built locally). The agent edits source files; once it submits, git diff HEAD is captured as the fix patch and written to the output JSONL.
Trajectories (full message history + cost metadata) are saved alongside the patches under results/<run>/trajectories/.
Use one worker for smoke tests so the logs and trajectories are easy to inspect. These commands exercise one representative instance per supported mobile stack, pull the benchmark images from GHCR automatically, and cap each smoke instance at 30 minutes with --instance_timeout 1800:
# Flutter / Dart
python run_agent_eval.py \
--dataset data/dataset.jsonl \
--output results/smoke-flutter-qwen/patches.jsonl \
--use_remote \
--workers 1 \
--instance_timeout 1800 \
--specifics "PalisadoesFoundation/talawa:pr-2469"
# Android / Kotlin
python run_agent_eval.py \
--dataset data/dataset.jsonl \
--output results/smoke-kotlin-qwen/patches.jsonl \
--use_remote \
--workers 1 \
--instance_timeout 1800 \
--specifics "commons-app/apps-android-commons:pr-6324"
# React Native / TypeScript
python run_agent_eval.py \
--dataset data/dataset.jsonl \
--output results/smoke-rn-qwen/patches.jsonl \
--use_remote \
--workers 1 \
--instance_timeout 1800 \
--specifics "NMF-earth/nmf-app:pr-424"After each run, inspect the corresponding trajectory JSON under results/<smoke-run>/trajectories/ and the post-run patch flags in results/<smoke-run>/patch_flags.jsonl.
| Flag | Default | Description |
|---|---|---|
--dataset |
(required) | Path to dataset JSONL |
--output |
(required) | Output Patch JSONL path |
--model |
(from YAML) | Model string — overrides model_name in platform YAML |
--workers |
1 |
Parallel Docker containers |
--use_remote |
off | Pull images from GHCR; delete after each run |
--ghcr_username |
MobileDev-Bench |
GHCR org that hosts the images |
--specifics |
(all) | Run only instances whose ID matches (e.g. "PaulWoitaschek/Voice:pr-42") |
--instance_timeout |
0 (off) |
Hard wall-clock limit per instance in seconds (see Timeouts) |
Create the evaluator scratch and output directories first; run_evaluation.py expects the --workdir path to already exist.
mkdir -p results/<run-id>/workdir results/<run-id>/eval results/<run-id>/eval-logs
python -m mobiledev_bench.harness.run_evaluation \
--mode evaluation \
--workdir results/<run-id>/workdir \
--patch_files results/<run-id>/patches.jsonl \
--dataset_files data/dataset.jsonl \
--output_dir results/<run-id>/eval \
--log_dir results/<run-id>/eval-logs \
--use_remote_images true \
--ghcr_username MobileDev-Bench \
--human_mode true \
--max_workers 4The evaluator mounts each agent patch at /home/fix.patch inside the pre-built image and runs bash /home/fix-run.sh, which applies the test patch + fix patch, executes the test_command, and parses the XML test results. --mode evaluation also generates results/<run-id>/eval/final_report.json; --mode instance_only only runs the containers and leaves per-instance logs under results/<run-id>/workdir/.../evals/. --human_mode true means the patch comes from an external patch file, such as this agent's patches.jsonl.
Models are resolved in this priority order:
1. --model CLI flag (highest priority)
2. model_name in platform YAML (per-platform default)
3. MSWEA_MODEL_NAME environment var
The platform YAML configs default to openrouter/qwen/qwen3-coder. To switch models for a run, pass --model:
# Qwen3-Coder via OpenRouter (default)
python run_agent_eval.py --dataset data/dataset.jsonl --output results/qwen3/patches.jsonl --use_remote
# Claude Sonnet via OpenRouter
python run_agent_eval.py --dataset data/dataset.jsonl --output results/claude/patches.jsonl \
--model openrouter/anthropic/claude-sonnet-4-5 --use_remote
# Claude directly via Anthropic API (needs ANTHROPIC_API_KEY)
python run_agent_eval.py --dataset data/dataset.jsonl --output results/claude/patches.jsonl \
--model anthropic/claude-sonnet-4-5 --use_remoteTo change the default for all runs, edit model_name at the top of the relevant YAML file(s) in minisweagent/config/benchmarks/.
| Model string prefix | Routes to | Cost tracking |
|---|---|---|
openrouter/… |
OpenRouterModel (direct HTTPS to OpenRouter API) |
Exact — reads usage.cost from API response |
| anything else | LitellmModel (via LiteLLM abstraction) |
Estimated via LiteLLM pricing table |
Use the openrouter/ prefix for all OpenRouter models. The prefix is stripped before the model name is sent to the OpenRouter API (e.g. openrouter/qwen/qwen3-coder → API receives qwen/qwen3-coder).
minisweagent/config/benchmarks/ contains one YAML per platform. Each controls the system prompt, instance template, model default, cost/step limits, per-command timeout, and environment variables.
| Platform | Config | Per-command timeout | Cost limit | Step limit | Key env vars |
|---|---|---|---|---|---|
| Android (Kotlin/Java) | android.yaml |
1200 s | $1.00 | 50 | GRADLE_OPTS=-Dorg.gradle.daemon=false |
| Flutter (Dart) | flutter.yaml |
600 s | $1.00 | 50 | — |
| React Native (TS/JS) | react-native.yaml |
600 s | $1.00 | 50 | NODE_OPTIONS, CI=true |
There are two independent timeout controls:
- Per-command timeout (
environment.timeoutin YAML) — the deadline for each individual bash command the agent issues. A single slow Gradle build won't kill the instance, just that one command if it overruns. Android gets 1200 s to handle cold Gradle builds; Flutter/RN get 600 s. - Per-instance wall-clock timeout (
--instance_timeout NCLI flag, default off) — a hard end-to-end limit for the whole instance. If hit, the partialgit diffis still collected and written. Only reliable with--workers 1(usesSIGALRM).
The primary brakes in normal operation are cost_limit: 1.00 (approximately $1 per instance) and step_limit: 50 (approximately 50 LLM calls), whichever is hit first.
| Variable | Required | Description |
|---|---|---|
OPENROUTER_API_KEY |
Yes (OpenRouter models) | OpenRouter API key |
ANTHROPIC_API_KEY |
Yes (direct Anthropic) | Anthropic API key |
MSWEA_MODEL_NAME |
No | Fallback model if not set in YAML or CLI |
MSWEA_SILENT_STARTUP |
Auto-set | Suppresses mini-swe-agent startup banner (set automatically by run_agent_eval.py) |
MSWEA_COST_TRACKING |
No | Set to ignore_errors to silence cost tracking warnings |
MSWEA_DOCKER_EXECUTABLE |
No | Override docker binary path |
Copy .env.example to .env and fill in your keys. The project .env is loaded automatically before any model or agent code runs.
Add a new repo/platform: create a file under mobiledev_bench/harness/repos/, define Image subclasses and an Instance subclass decorated with @Instance.register(org, repo). Add a YAML config under minisweagent/config/benchmarks/ if the platform needs different prompts or timeouts.
Use a different model: either pass --model <string> at the CLI, or edit model_name in the relevant platform YAML.
Customise agent behaviour: subclass DefaultAgent in minisweagent/agents/ and instantiate it in run_agent_eval.py instead of DefaultAgent.