hypotest

Installation

uv venv
uv sync

The dataset is available on HuggingFace: EdisonScientific/bixbench_hypothesis.

You'll also need the capsule data directory accessible on your filesystem.

Downloading Capsule Data

The task capsule data is hosted on a public HuggingFace bucket:

hf sync hf://buckets/EdisonScientific/bixbench-hypothesis-capsules /path/to/capsules/

Running the Dataset Server

Create a server.yaml config file:

dataset:
  hf_dataset: EdisonScientific/bixbench_hypothesis
  capsule_dir: /path/to/capsules/
  save_dir: /path/to/outputs/ # optional, for saving rollout artifacts

api_key: YOUR_API_KEY # or env var name like HYPOTEST_SERVER_API_KEY

Alternatively, you can point to a local JSONL file instead of the HuggingFace dataset:

dataset:
  problem_jsonl: /path/to/tasks.jsonl
  capsule_dir: /path/to/capsules/

api_key: YOUR_API_KEY

Start the server:

make server CONFIG=server.yaml

Running Benchmarks

Create a benchmark.yaml config file:

results_dir: benchmark_results/

api_key: YOUR_API_KEY # must match server api_key

agent_config:
  agent_kwargs:
    llm_model:
      name: openai/gpt-5
      temperature: 1.0
      timeout: 600
      config:
        model_list:
          - model_name: openai/gpt-5
            litellm_params:
              model: openai/gpt-5
              timeout: 600
              temperature: 1.0
              reasoning_effort: medium

Run the benchmark:

uv run python src/hypotest/benchmark_agent.py benchmark.yaml

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
src/hypotest		src/hypotest
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hypotest

Installation

Downloading Capsule Data

Running the Dataset Server

Running Benchmarks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

hypotest

Installation

Downloading Capsule Data

Running the Dataset Server

Running Benchmarks

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages