MobileDev-Bench 📱

A benchmark for evaluating large language models on real-world mobile app issue-resolution tasks across Android, Flutter, and React Native, with executable validation and substantially greater patch complexity than existing software engineering benchmarks.

Requirements

Python 3.10+
uv package manager
Docker (for running evaluation environments)

Installation

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repository
git clone https://github.com/MobileDev-Bench/mobiledev-bench
cd mobiledev-bench

# Create virtual environment and install dependencies
uv sync

# Activate the environment
source .venv/bin/activate

Environment Variables

Variable	Description
`GITHUB_TOKENS`	GitHub API tokens for data collection (comma-separated)

Create a .env file in the project root:

OPENROUTER_API_KEY=<your_api_key_here>
GITHUB_TOKENS=<your_github_token_here>

Pulling Docker Images

Pre-built Docker images for each benchmark instance are hosted on GitHub Container Registry (GHCR). Pull them with:

./scripts/pull_images_from_ghcr.sh mobiledev-bench

The script reads image names from scripts/images.txt.

Modules

`mobiledev_bench/collect/`

Tools for collecting pull request data from GitHub repositories and building task instances.

# Collect GitHub PR data and build task instances
python -m mobiledev_bench.collect.collect_github_data \
    --repos owner/repo \
    --path_prs data/prs \
    --path_tasks data/tasks

Requires a GITHUB_TOKENS environment variable (set in a .env file or exported).

`mobiledev_bench/harness/`

Evaluation harness that runs model predictions against task instances inside Docker containers.

Parameters:

Parameter	Description	Default
`--mode`	Run mode: `evaluation`, `instance`, `instance_only`, `image`	`evaluation`
`--patch_files`	Path(s) to model prediction patch files (glob patterns supported)	—
`--dataset_files`	Path(s) to dataset `.jsonl` files (glob patterns supported)	—
`--workdir`	Working directory for intermediate build artifacts	—
`--output_dir`	Directory to write evaluation results	—
`--log_dir`	Directory to write logs	—
`--repo_dir`	Local directory for cloning repositories (not needed with remote images)	—
`--use_remote_images`	Pull Docker images from GHCR instead of building locally	`false`
`--ghcr_username`	GHCR organisation to pull images from	`mobiledev-bench`
`--max_workers_run_instance`	Parallel workers for running instances	`8`
`--stop_on_error`	Abort on first failure	`true`
`--specifics`	Run only specific instance IDs	—
`--skips`	Skip specific instance IDs	—
`--log_level`	Logging verbosity: `DEBUG`, `INFO`, `WARNING`, `ERROR`	`INFO`

Example:

python3 -m mobiledev_bench.harness.run_evaluation \
    --mode evaluation \
    --use_remote_images true \
    --ghcr_username mobiledev-bench \
    --patch_files /path/to/patches.jsonl \
    --dataset_files /path/to/dataset.jsonl \
    --workdir /path/to/workdir \
    --repo_dir /tmp \
    --output_dir /path/to/output \
    --log_dir /path/to/logs \
    --max_workers_run_instance 4 \
    --stop_on_error false \
    --log_level INFO

Citation

If you find MobileDev-Bench useful for your research, please cite our paper:

to be updated

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
mobiledev_bench		mobiledev_bench
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MobileDev-Bench 📱

Requirements

Installation

Environment Variables

Pulling Docker Images

Modules

`mobiledev_bench/collect/`

`mobiledev_bench/harness/`

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

MobileDev-Bench 📱

Requirements

Installation

Environment Variables

Pulling Docker Images

Modules

mobiledev_bench/collect/

mobiledev_bench/harness/

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`mobiledev_bench/collect/`

`mobiledev_bench/harness/`

Packages