Skip to content

vitillo/rolo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rolo

A minimal RL framework for LLMs.

RL fine-tuning of language models follows a straightforward loop: load problems, roll out conversations, score the results, compute advantages, and update the model weights. Each of these steps maps to a Python protocol — DataLoader, Env, Generator, RewardModel, Model — with minimal coupling between them. The two included examples (GSM8K math and MuSiQue multi-hop search) are self-contained single-file programs that wire these pieces together.

The loop

DataLoader  →  Env  →  Generator  →  RewardModel  →  GRPOTrainer  →  Model.update()
 (problems)   (rollouts)  (trajectories)  (scores)     (advantages)    (weight update)

One GRPO step:

  1. DataLoader yields a batch of examples.
  2. Generator builds an Env per example, samples from the Policy, and collects Trajectory groups.
  3. RewardModel scores each trajectory.
  4. GRPOTrainer mean-centers rewards within each group (the GRPO baseline) and calls Model.update().

Three layers

The framework keeps three layers separate, each with its own types:

  • MessagesMessage, ToolCall. The chat protocol.
  • RLState, Action, Env, Step, Trajectory, Reward. Environments and rollouts.
  • Tokens — prompt ids, completion ids, logprobs. The Renderer bridges messages and tokens.

The Renderer is the seam. Envs work with messages. Training works with token ids. The renderer translates between the two so envs stay model-agnostic.

Protocols

Everything pluggable is a Python Protocol — no base classes, no registration:

Protocol Does
Env start() → State, step(Action) → State | None
EnvBuilder build(example) → Env
Renderer messages ↔ tokens
RewardModel score(example, Trajectory) → Reward
Policy sample(prompt) → CompletionOutput
Model policy() → Policy, update(batch) → UpdateResult
Generator generate(examples, ...) → list[list[Trajectory]]
DataLoader batches() → Iterator[list[Example]]

Package layout

rolo/
  message.py          # Message, ToolCall
  rl.py               # State, Action, Env, Step, Trajectory, Reward
  rendering.py        # Renderer, RenderedPrompt, ToolSpec
  generation.py       # Policy, Generator, RolloutGenerator
  rewards.py          # RewardModel
  model.py            # Model, TrainingBatch
  training.py         # GRPOTrainer, GRPOConfig, compute_advantages
  data.py             # DataLoader, HuggingFaceDataLoader
  logging.py          # MetricsLogger, TensorBoardLogger
  tinker_backend.py   # Tinker model + renderer adapters
  examples/
    gsm8k.py          # Single-turn math (GSM8K)
    musique_search.py # Multi-turn search agent (MuSiQue)

Examples

GSM8K — single-turn math

A one-shot prompt, a boxed-answer format, and math-verify for semantic grading. The simplest possible GRPO setup.

uv run python -m rolo.examples.gsm8k \
  --model-name meta-llama/Llama-3.2-1B \
  --batch-size 64 \
  --samples-per-prompt 8 \
  --max-steps 100 \
  --eval-on-start \
  --eval-every-steps 10 \
  --eval-limit 100 \
  --run-dir runs/gsm8k \
  --save-best-checkpoint

Results from a 100-step run on Llama-3.2-1B (base, not instruct):

Step Eval accuracy Pass@8
0 0.4% 3%
40 5.4% 26%
80 8.5% 37%

Format compliance (\boxed{}) saturates within the first few steps thanks to the one-shot example. Pass@8 (at least 1 of 8 samples correct) shows the model has latent capability on ~37% of problems despite low per-sample accuracy.

MuSiQue — multi-turn search agent

A local per-example knowledge base with one structured search tool, BM25 retrieval, and a reward model that scores answer correctness while penalizing tool-call loops. This is the smallest useful multi-turn agent task in the repo.

Requires a model with native tool-calling support (e.g. Qwen 3.5):

uv run python -m rolo.examples.musique_search \
  --model-name Qwen/Qwen3.5-4B \
  --batch-size 4 \
  --samples-per-prompt 2 \
  --max-turns 2 \
  --learning-rate 8e-5 \
  --max-tokens 256 \
  --train-limit 32 \
  --eval-limit 16 \
  --run-dir runs/musique_search

Tinker backend

The concrete Model and Renderer implementations use the Tinker service for remote LoRA training and sampling. Local training is not yet implemented.

  • TinkerModelConfig.project_id must be an existing Tinker project id if set.
  • TinkerRenderer uses tinker_cookbook tokenizer resolution and may fetch tokenizer files from the Hub on first use.

About

A minimal RL framework for LLMs

Resources

Stars

Watchers

Forks

Contributors

Languages