mobile-swe-agent

A mobile-platform SWE agent built on top of mini-swe-agent, adapted to work with the mobiledev-bench harness for Android (Kotlin/Java), Flutter (Dart), and React Native (TypeScript/JavaScript) benchmarks.

Project Layout

mobile-swe-agent/
├── minisweagent/
│   ├── agents/
│   ├── environments/
│   ├── models/
│   ├── config/
│   │   ├── mini.yaml
│   │   ├── default.yaml
│   │   └── benchmarks/
│   └── run/
├── mobiledev_bench/
│   ├── harness/
│   └── utils/
├── eval_runner/
│   └── eval/
├── data/
├── docs/
├── tests/
├── results/
├── run_agent_eval.py
├── pyproject.toml
├── .env.example
└── .gitignore

minisweagent/: local mini-swe-agent copy and platform benchmark configs.
mobiledev_bench/: local MobileDev-Bench harness copy.
eval_runner/: glue code for dataset loading, GHCR images, prompts, validation, and patch capture.
run_agent_eval.py: Stage 2 CLI for generating patches.jsonl.

Installation

cd mobile-swe-agent
pip install -e ".[dev]"

cp .env.example .env
# Open .env and add your OPENROUTER_API_KEY

Dataset

Place your dataset JSONL under data/. See data/README.md for the full format spec.

The script accepts both formats transparently:

Dataset JSONL — produced by build_dataset.py; includes baseline test results (run_result, test_patch_result, fix_patch_result) from prior harness runs. This is the normal input.
Raw PullRequest JSONL — just the benchmark instance fields, without result data.

With GHCR images (--use_remote) you only need the JSONL file — no local image build step.

Three-Stage Pipeline

Stage 1 — Build Docker images (skip if using GHCR)

Pre-built images for all instances are hosted at github.com/orgs/MobileDev-Bench/packages. Pass --use_remote in Stage 2 to pull them automatically — no local build needed.

To build locally instead:

python -m mobiledev_bench.harness.build_dataset \
    --mode      image \
    --workdir   ./workdir \
    --dataset_files data/dataset.jsonl \
    --repo_dir  ./repos

Stage 2 — Run agent evaluation

python run_agent_eval.py \
    --dataset    data/dataset.jsonl \
    --output     results/<run-id>/patches.jsonl \
    --use_remote \
    --workers    4 \
    --instance_timeout 1800

Each instance runs inside its harness-built Docker image (pulled from GHCR or built locally). The agent edits source files; once it submits, git diff HEAD is captured as the fix patch and written to the output JSONL.

Trajectories (full message history + cost metadata) are saved alongside the patches under results/<run>/trajectories/.

Smoke tests

Use one worker for smoke tests so the logs and trajectories are easy to inspect. These commands exercise one representative instance per supported mobile stack, pull the benchmark images from GHCR automatically, and cap each smoke instance at 30 minutes with --instance_timeout 1800:

# Flutter / Dart
python run_agent_eval.py \
  --dataset data/dataset.jsonl \
  --output results/smoke-flutter-qwen/patches.jsonl \
  --use_remote \
  --workers 1 \
  --instance_timeout 1800 \
  --specifics "PalisadoesFoundation/talawa:pr-2469"

# Android / Kotlin
python run_agent_eval.py \
  --dataset data/dataset.jsonl \
  --output results/smoke-kotlin-qwen/patches.jsonl \
  --use_remote \
  --workers 1 \
  --instance_timeout 1800 \
  --specifics "commons-app/apps-android-commons:pr-6324"

# React Native / TypeScript
python run_agent_eval.py \
  --dataset data/dataset.jsonl \
  --output results/smoke-rn-qwen/patches.jsonl \
  --use_remote \
  --workers 1 \
  --instance_timeout 1800 \
  --specifics "NMF-earth/nmf-app:pr-424"

After each run, inspect the corresponding trajectory JSON under results/<smoke-run>/trajectories/ and the post-run patch flags in results/<smoke-run>/patch_flags.jsonl.

All flags

Flag	Default	Description
`--dataset`	(required)	Path to dataset JSONL
`--output`	(required)	Output Patch JSONL path
`--model`	(from YAML)	Model string — overrides `model_name` in platform YAML
`--workers`	`1`	Parallel Docker containers
`--use_remote`	off	Pull images from GHCR; delete after each run
`--ghcr_username`	`MobileDev-Bench`	GHCR org that hosts the images
`--specifics`	(all)	Run only instances whose ID matches (e.g. `"PaulWoitaschek/Voice:pr-42"`)
`--instance_timeout`	`0` (off)	Hard wall-clock limit per instance in seconds (see Timeouts)

Stage 3 — Score patches

Create the evaluator scratch and output directories first; run_evaluation.py expects the --workdir path to already exist.

mkdir -p results/<run-id>/workdir results/<run-id>/eval results/<run-id>/eval-logs

python -m mobiledev_bench.harness.run_evaluation \
    --mode              evaluation \
    --workdir           results/<run-id>/workdir \
    --patch_files       results/<run-id>/patches.jsonl \
    --dataset_files     data/dataset.jsonl \
    --output_dir        results/<run-id>/eval \
    --log_dir           results/<run-id>/eval-logs \
    --use_remote_images true \
    --ghcr_username     MobileDev-Bench \
    --human_mode        true \
    --max_workers       4

The evaluator mounts each agent patch at /home/fix.patch inside the pre-built image and runs bash /home/fix-run.sh, which applies the test patch + fix patch, executes the test_command, and parses the XML test results. --mode evaluation also generates results/<run-id>/eval/final_report.json; --mode instance_only only runs the containers and leaves per-instance logs under results/<run-id>/workdir/.../evals/. --human_mode true means the patch comes from an external patch file, such as this agent's patches.jsonl.

Model Selection

Models are resolved in this priority order:

1. --model CLI flag                    (highest priority)
2. model_name in platform YAML         (per-platform default)
3. MSWEA_MODEL_NAME environment var

The platform YAML configs default to openrouter/qwen/qwen3-coder. To switch models for a run, pass --model:

# Qwen3-Coder via OpenRouter (default)
python run_agent_eval.py --dataset data/dataset.jsonl --output results/qwen3/patches.jsonl --use_remote

# Claude Sonnet via OpenRouter
python run_agent_eval.py --dataset data/dataset.jsonl --output results/claude/patches.jsonl \
    --model openrouter/anthropic/claude-sonnet-4-5 --use_remote

# Claude directly via Anthropic API (needs ANTHROPIC_API_KEY)
python run_agent_eval.py --dataset data/dataset.jsonl --output results/claude/patches.jsonl \
    --model anthropic/claude-sonnet-4-5 --use_remote

To change the default for all runs, edit model_name at the top of the relevant YAML file(s) in minisweagent/config/benchmarks/.

Model routing

Model string prefix	Routes to	Cost tracking
`openrouter/…`	`OpenRouterModel` (direct HTTPS to OpenRouter API)	Exact — reads `usage.cost` from API response
anything else	`LitellmModel` (via LiteLLM abstraction)	Estimated via LiteLLM pricing table

Use the openrouter/ prefix for all OpenRouter models. The prefix is stripped before the model name is sent to the OpenRouter API (e.g. openrouter/qwen/qwen3-coder → API receives qwen/qwen3-coder).

Platform Configs

minisweagent/config/benchmarks/ contains one YAML per platform. Each controls the system prompt, instance template, model default, cost/step limits, per-command timeout, and environment variables.

Platform	Config	Per-command timeout	Cost limit	Step limit	Key env vars
Android (Kotlin/Java)	`android.yaml`	1200 s	$1.00	50	`GRADLE_OPTS=-Dorg.gradle.daemon=false`
Flutter (Dart)	`flutter.yaml`	600 s	$1.00	50	—
React Native (TS/JS)	`react-native.yaml`	600 s	$1.00	50	`NODE_OPTIONS`, `CI=true`

Timeouts explained

There are two independent timeout controls:

Per-command timeout (environment.timeout in YAML) — the deadline for each individual bash command the agent issues. A single slow Gradle build won't kill the instance, just that one command if it overruns. Android gets 1200 s to handle cold Gradle builds; Flutter/RN get 600 s.
Per-instance wall-clock timeout (--instance_timeout N CLI flag, default off) — a hard end-to-end limit for the whole instance. If hit, the partial git diff is still collected and written. Only reliable with --workers 1 (uses SIGALRM).

The primary brakes in normal operation are cost_limit: 1.00 (approximately $1 per instance) and step_limit: 50 (approximately 50 LLM calls), whichever is hit first.

Environment Variables

Variable	Required	Description
`OPENROUTER_API_KEY`	Yes (OpenRouter models)	OpenRouter API key
`ANTHROPIC_API_KEY`	Yes (direct Anthropic)	Anthropic API key
`MSWEA_MODEL_NAME`	No	Fallback model if not set in YAML or CLI
`MSWEA_SILENT_STARTUP`	Auto-set	Suppresses mini-swe-agent startup banner (set automatically by `run_agent_eval.py`)
`MSWEA_COST_TRACKING`	No	Set to `ignore_errors` to silence cost tracking warnings
`MSWEA_DOCKER_EXECUTABLE`	No	Override docker binary path

Copy .env.example to .env and fill in your keys. The project .env is loaded automatically before any model or agent code runs.

Extending

Add a new repo/platform: create a file under mobiledev_bench/harness/repos/, define Image subclasses and an Instance subclass decorated with @Instance.register(org, repo). Add a YAML config under minisweagent/config/benchmarks/ if the platform needs different prompts or timeouts.

Use a different model: either pass --model <string> at the CLI, or edit model_name in the relevant platform YAML.

Customise agent behaviour: subclass DefaultAgent in minisweagent/agents/ and instantiate it in run_agent_eval.py instead of DefaultAgent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

mobile-swe-agent

Project Layout

Installation

Dataset

Three-Stage Pipeline

Stage 1 — Build Docker images (skip if using GHCR)

Stage 2 — Run agent evaluation

Smoke tests

All flags

Stage 3 — Score patches

Model Selection

Model routing

Platform Configs

Timeouts explained

Environment Variables

Extending

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
docs		docs
eval_runner		eval_runner
minisweagent		minisweagent
mobiledev_bench		mobiledev_bench
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
run_agent_eval.py		run_agent_eval.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

mobile-swe-agent

Project Layout

Installation

Dataset

Three-Stage Pipeline

Stage 1 — Build Docker images (skip if using GHCR)

Stage 2 — Run agent evaluation

Smoke tests

All flags

Stage 3 — Score patches

Model Selection

Model routing

Platform Configs

Timeouts explained

Environment Variables

Extending

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages