GRRM: Group Relative Reward Modeling for Machine Translation

Official implementation of the paper: GRRM: Group Relative Reward Modeling for Machine Translation.

This repository contains the code for GRRM (Group Relative Reward Model), a Generative Reward Model** instantiated under the proposed **Group Quality Metric (GQM) paradigm for reinforcement learning-based machine translation (MT) optimization with GRPO.

The RL training pipeline is powered by our verl with Generative Reward Models support fork, designed to be generic and reusable for any GenRM-based RL training.

Highlights

Improved reward accuracy: large gains especially on challenging samples (idioms, slang, terminology) and restores reward variance needed by GRPO advantages. Also Robust to reward hacking.
GRRM: efficient GQM-based reward model trained with SFT + RLVR.
Cross-lingual generalization: GRRM trained on Zh–En can support multilingual MT optimization.
MT optimization: GRPO training with GRRM yields strong improvements on WMT-style benchmarks and challenge sets.

Ranking accuracy performance and downstream translation performance on Seed-X-Challenge:

Method Overview

Core idea: Standard generative reward models often evaluate candidates independently (Scalar Quality Metric, SQM) and suffer from score saturation, which causes vanishing advantages in GRPO and stalls optimization.
We propose GQM, which evaluates a group of candidates jointly to produce reliable fine-grained intra-group ranking, and implement it as GRRM, a high-throughput reward model with explicit comparative reasoning.

GQM evaluates the entire group together and outputs:

comparative analysis
predicted ranking
scores consistent with that ranking

GRRM Example:

[prompt]
Given a source text in English and multiple translation candidates in Chinese. Perform a step by step analysis and comparison of the translation quality for the candidates. Finally, rank and score the 
candidates with integer scores on a scale from 0 to 10.

Source text:
```
I have a frog in my throat.
```

Translation A:
```
我嗓子眼里好像有只青蛙。,
```
Translation B:
```
我嗓子有点哑。
```

[response]
An analysis and comparison of the translation candidates:

**Source Text:** "I have a frog in my throat."
This is an English idiomatic expression. Literally, it doesn't mean a frog is physically stuck in one's throat. It means the person has a tickle, lump, or irritation in their throat that affects their voice, often making them sound hoarse. It is used to describe a temporary condition.

**Translation A: 我嗓子有点哑。**
*   **Faithfulness:** This translation is not literal but captures the *consequence* or *function* of the source idiom. "我嗓子有点哑" means "My voice is a little hoarse." This is a very common and natural way in Chinese to express the same situation as "I have a frog in my throat." It correctly conveys the practical meaning in this context.
*   **Fluency:** This is a perfectly fluent and idiomatic Chinese expression that a native speaker would use in daily conversation.
*   **Overall:** A good translation that prioritizes communicative effectiveness over literalism. It is highly acceptable.

**Translation B: 我嗓子眼里好像有只青蛙。**
*   **Faithfulness:** This translation is very literal. "我嗓子眼里" means "in my throat," and "好像" ("as if" or "seemingly") is an attempt to translate the metaphorical nature of the source. The main problem is "有只青蛙" (there is a frog). This is a direct, word-for-word physical translation that completely misses the idiomatic meaning. It explains the phrase as if one was literally trying to describe a real frog in the throat.
*   **Fluency:** This sentence is grammatically correct, but it sounds highly unnatural and awkward in Chinese. While hypothetically it could be understood literally by some as a form of sick and disturbing joke, it is not used to express hoarseness in any common, standard way.
*   **Overall:** A poor translation because it fails to convey the correct idiomatic meaning and results in an unnatural, almost comical phrasing.

**Conclusion:**

*   **Translation A** is the superior choice. It effectively communicates the intended meaning in a natural, idiomatic way that is common in Chinese.
*   **Translation B** is a very literal, inaccurate, and awkward attempt that misses the point of the idiom and sounds incorrect.

### Final Ranking:

A > B

### Scores:

A: 8, B: 3

Performance Overview

1. Reward Model Ranking Accuracy

Ranking accuracy (%) on the held-out test set (Gemini-annotated) versus human-annotated benchmarks and the challenge set. GRRM achieves state-of-the-art ranking performance, especially on complex reasoning scenarios (Seed-X-Challenge).

Model	Paradigm	held-out (Zh↔En)	NT20 (Zh→En)	GenMT22 (En→Zh)	GenMT22 (Zh→En)	GenMT22 (En→De)	GenMT22 (En→Ru)	Seed-X-Challenge	Avg.
LLM-as-a-Judge
DeepSeek-R1-0528	SQM	66.11	48.42	43.67	58.09	53.64	59.13	47.22	51.69
DeepSeek-R1-0528	GQM	80.92	61.98	64.38	65.79	63.26	69.15	81.82	67.73
Discriminative RMs
CometKiwi-XXL	SQM	72.01	57.82	66.49	61.60	61.20	67.12	46.72	60.16
BT-RM	SQM	82.62	58.16	66.49	64.92	58.14	66.10	58.84	62.11
Generative RMs
SQM-GenRM (RLVR)	SQM	64.25	49.21	39.42	60.37	49.16	54.82	38.38	48.56
GRRM (RLVR)	GQM	82.58	57.77	62.17	66.00	61.04	66.67	70.39	64.01

2. Downstream Machine Translation Performance

MT performance on WMT and Seed-X-Challenge benchmarks. We report BLEURT-20 and LLM-as-a-Judge scores (evaluated by DeepSeek-R1-0528). Optimizing with GRRM via GRPO significantly improves the translation quality and reasoning capabilities of the base model.

Model	WMT Zh→En (BLEURT / R1)	WMT En→Zh (BLEURT / R1)	WMT En→X (BLEURT / R1)	Seed-X Zh→En (BLEURT / R1)	Seed-X En→Zh (BLEURT / R1)
General LLMs
Gemini-2.5-Pro	68.66 / 92.92	66.00 / 91.31	68.87 / 90.35	71.59 / 89.41	69.19 / 86.06
DeepSeek-R1-0528	67.78 / 92.34	64.87 / 89.24	67.72 / 88.48	70.92 / 87.95	68.23 / 84.40
Qwen2.5-7B-Instruct	67.31 / 88.49	59.92 / 80.51	58.72 / 72.51	66.59 / 79.23	62.75 / 72.37
Specialized Models
TowerInstruct-13B	67.56 / 84.83	62.92 / 77.63	66.61 / 82.68	63.32 / 69.54	63.46 / 71.17
SeedX-PPO	69.02 / 90.47	67.21 / 87.98	68.35 / 86.04	69.37 / 82.47	68.72 / 80.56
SSR-X-Zero-7B	68.30 / 88.67	66.12 / 83.78	- / -	68.84 / 81.15	67.08 / 77.56
Qwen2.5-7B-SFT	67.07 / 87.78	59.99 / 76.98	57.14 / 67.91	67.65 / 80.91	62.36 / 72.42
⭐+ GRPO	67.41 / 92.24	64.80 / 87.80	64.65 / 83.86	69.55 / 85.90	67.05 / 82.55
⭐+ GRPO w/ CLA	67.39 / 92.09	63.91 / 88.29	64.50 / 83.71	69.25 / 88.58	67.07 / 83.33

Getting Started

Installation

git clone https://github.com/NJUNLP/GRRM.git
cd GRRM
pip install -e . --no-build-isolation

Extra dependencies available:

infer: install vllm for inference.
eval: sacrebleu and bleurt for translation evaluation.

For model training, additional dependencies are required. We use Llama-Factory for SFT training and verl with Generative Reward Models support for reinforcement learning training.

Quick Use

1) Use GRRM to rank a group of translations (GQM inference)

This script performs Group Quality Metric (GQM) inference using vLLM to evaluate and rank multiple translation candidates. It includes prompt templates, result parsing and return the scores and raw model outputs.

Example usage:

import inference.run_rm_GQM as rm_GQM

output = rm_GQM.func_call(
    model_path="double7/GRRM-Qwen2.5-7B",
    src_list=["I have a frog in my throat."],
    mt_list=[["我嗓子有点哑。", "我嗓子眼里好像有只青蛙。"]],
    src_langs=["en"],
    trg_langs=["zh"],
    temperature=1.0,
    top_p=1.0,
    max_new_tokens=8192,
    retry=6,
    prompt_type="ranking_score",
)

# output["scores"] -> [[8, 3]]  # scores for each candidate
# output["responses"] -> ["...model response text..."]

Note

Inference at low temperature may fail. Set retry to automatically retry with higher temperature.
For GRRM, set prompt_type to ranking_score.

2) Use GRRM-Optimized MT model for translation inference

This script performs machine translation inference using vLLM. It supports multiple prompt formats and answer extraction methods for different model types. It returns a dictionary with translation responses and raw model outputs.

Example usage:

import inference.run_mt as mt

output = mt.func_call(
    model_path="double7/Qwen2.5-7B-MT-GRRM-Optimized",
    src_list=["The grass is always greener on the other side.", "INTJ总是装E"],
    src_langs=["en", "zh"],
    trg_langs=["zh", "en"],
    temperature=0.7,
    top_p=0.9,
    max_new_tokens=8192,
    retry=6,
    prompt_type="codeblock-think",
    use_chat_template=True,
)

# output["responses"] -> ["这山望着那山高。", "INTJs are always putting on an extroverted front."]
# output["raw_outputs"] -> ["...raw model output...", "...raw model output..."]

Note

Inference at low temperature may fail. Set retry to automatically retry with higher temperature.
For GRRM-Optimized MT models, set prompt_type to codeblock-think.

Training Pipeline

Additional Codebase

Our reinforcement learning pipeline for GRRM is based on our fork: verl with Generative Reward Models support. It is developed as the training backend for GRRM, but the GenRM support is designed to be generic and reusable.

Detailed Training Recipes

For GRRM training, please refer to GRRM Training Pipeline.

For using GRRM for downstream machine translation GRPO training, please refer to MT Training Pipeline.

Evaluation

Ranking Accuracy Evaluation

Data Preparation:

Run the following commands to download the held-out testset with Gemini-annotated rankings:

hf download double7/TowerBlocks-MT-Ranking --repo-type dataset --local-dir parquet_data/TowerBlocks-MT-Ranking

Run the following commands to download the human-derived testset for ranking accuracy evaluation, along with the Seed-X-Challenge binary ranking task data:

hf download double7/MT_Ranking_Metric_Test --repo-type dataset --local-dir parquet_data/MT_Ranking_Metric_Test

Evaluation:

Run evaluation for Qwen2.5-7B-GRRM on all datasets:

python eval/run_ranking_acc_eval.py \
    --data_id tower_zhen_ranking_testset,wmt_newstest2020_psqm,wmt_generalMT2022_enzh_mqm,wmt_generalMT2022_zhen_mqm,wmt_generalMT2022_ende_mqm,wmt_generalMT2022_enru_mqm,seedx_challenge_ranking \
    --model_path path/to/Qwen2.5-7B-GRRM \
    --model_name Qwen2.5-7B-GRRM \ # model name for logging
    --temperature 0 \
    --top_p 1.0 \
    --max_new_tokens 8192 \
    --prompt_type ranking_score \
    --model_type grrm \
    --runs 4

Note

The mapping between data id and data path is defined in RANKING_TEST_DATA_META_INFO of utils/config.py.

Machine Translation Evaluation

Data Preparation:

We use existing public datasets for evaluation. For WMT23 test, refer to sacrebleu tool. For WMT24++, refer to google/wmt24pp. For Seed-X-Challenge set, refer to ByteDance-Seed/Seed-X-7B.

Data Format

Convert downloaded data to parquet format with the following required columns:

Column Name	Description	Required
`src_text`	Source text to translate	Yes
`trg_text`	Reference translation (ground truth)	Yes
`src_lang`	Source language code (e.g., "en", "zh", "de")	Yes
`trg_lang`	Target language code (e.g., "en", "zh", "de")	Yes
`comment`	Evaluation hints/translation focus points (optional)	No

Note

Seed-X-Challenge provides additional translation evaluation points. These should be stored in the comment column. When present, the comment will be concatenated with the reference translation and provided to LLM-based evaluation models (e.g., gpt-oss). BLEURT evaluation does not use these comments.

Dataset Configuration

Register your datasets in utils/config.py by adding entries to MT_TEST_DATA_META_INFO:

Evaluation:

Run evaluation for Qwen2.5-7B-MT-GRRM-Optimized on Zh<->En data:

python eval/run_mt_eval.py \
    --data_id seedx_challenge_zhen,seedx_challenge_enzh,wmt24pp_en_zh,wmt23_zh_en \
    --model_path path/to/Qwen2.5-7B-MT-GRRM-Optimized \
    --model_name Qwen2.5-7B-MT-GRRM-Optimized \
    --temperature 0 \
    --top_p 1 \
    --max_new_tokens 8192 \
    --metrics '["bleurt","oss"]' \
    --prompt_type codeblock-think \
    --runs 4

Citation

@article{yang2026grrmgrouprelativereward,
      title={GRRM: Group Relative Reward Modeling for Machine Translation}, 
      author={Sen Yang and Shanbo Cheng and Lu Xu and Jianbing Zhang and Shujian Huang},
      year={2026},
      eprint={2602.14028},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.14028},
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
eval		eval
inference		inference
recipes		recipes
scripts		scripts
static		static
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
VERSION		VERSION
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GRRM: Group Relative Reward Modeling for Machine Translation

Highlights

Method Overview

Performance Overview

1. Reward Model Ranking Accuracy

2. Downstream Machine Translation Performance

Getting Started

Installation

Quick Use

Training Pipeline

Additional Codebase

Detailed Training Recipes

Evaluation

Ranking Accuracy Evaluation

Machine Translation Evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GRRM: Group Relative Reward Modeling for Machine Translation

Highlights

Method Overview

Performance Overview

1. Reward Model Ranking Accuracy

2. Downstream Machine Translation Performance

Getting Started

Installation

Quick Use

Training Pipeline

Additional Codebase

Detailed Training Recipes

Evaluation

Ranking Accuracy Evaluation

Machine Translation Evaluation

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages