consolidate mbridge distillation: merge distill_hf.py into distill.py by j-rausch · Pull Request #1220 · NVIDIA/Model-Optimizer

j-rausch · 2026-04-09T12:29:43Z

Summary

Unified examples/puzzletron/mbridge_distillation/distill_hf.py (AnyModel-specific) into examples/megatron_bridge/distill.py (general)
- The single script now handles both standard HF and Puzzletron AnyModel checkpoints.
Added --hf_export_path / --student_hf_model args for inline HF export after distillation.
Merged AnyModel integration test into tests/examples/megatron_bridge/test_distill.py
- test models use vocab_size=128 (instead of default 102) for TP divisibility including 8.
Moved MMLU distillation results into megatron_bridge/README.md
- puzzletron README now redirects to the consolidated docs.

Limitation discovered during consolidation:
HF export via --hf_export_path seems to currently not work for Puzzletron AnyModel (heterogeneous) checkpoints. Megatron-Bridge's export_ckpt cannot reload heterogeneous model configs from saved checkpoints (heterogeneous_layers_config_encoded_json is None during __post_init__ in heterogeneous_config.py). This affects both inline --hf_export_path and the separate convert_checkpoints.py export script.

The original distill_hf.py README documented this as supported, but I think it might have been broken there too (on the side of Megatron-Bridge). The consolidated README now documents this as a known limitation. HF export for standard models works fine via both methods.

Summary by CodeRabbit

New Features
- Optional export of distilled checkpoints to HuggingFace format at distillation completion; distillation now accepts both standard HuggingFace and Puzzletron AnyModel checkpoints (HF export unsupported for heterogeneous AnyModel checkpoints).
Documentation
- Consolidated distillation guide with two export workflows, clarified output locations, MMLU comparison tables (including a regression) and recommendations; removed the older Puzzletron-specific guide.
Tests
- Added integration test for Puzzletron AnyModel distillation; removed HF-only distillation test and the HF-only distillation script.
Chores
- Example test matrix updated to run only Megatron Bridge examples; test tokenizer fixtures extended with additional special tokens.

…still.py Signed-off-by: jrausch <jrausch@nvidia.com> Signed-off-by: root <root@pool0-00848.cm.cluster>

copy-pr-bot · 2026-04-09T12:29:47Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-04-09T12:30:00Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 84089e08-12d5-41e3-ba6b-01ba0aa24b2c

📥 Commits

Reviewing files that changed from the base of the PR and between 3c0fa5a and 0c2a2ee.

📒 Files selected for processing (1)

tests/_test_utils/torch/puzzletron/utils.py

✅ Files skipped from review due to trivial changes (1)

tests/_test_utils/torch/puzzletron/utils.py

📝 Walkthrough

Walkthrough

Consolidates Megatron-Bridge distillation into examples/megatron_bridge/distill.py, adds optional HuggingFace export flags and guarded Puzzletron export import, removes the standalone Puzzletron mbridge distillation script and its test, updates docs/tests for Puzzletron AnyModel support, and expands test tokenizer vocab.

Changes

Cohort / File(s)	Summary
Docs — megatron_bridge `examples/megatron_bridge/README.md`	Revamped distillation docs: state `distill.py` accepts HF and Puzzletron AnyModel checkpoints; document inline HF export flags (`--hf_export_path`, `--student_hf_model`) and alternative `convert_checkpoints.py export` workflow; note HF export limitation for heterogeneous AnyModel checkpoints; add MMLU results and recommendations.
Distillation Script `examples/megatron_bridge/distill.py`	Added CLI args `--hf_export_path`, `--student_hf_model` (validation when exporting); added guarded import of Puzzletron export module; extended end-of-distillation flow to call `dist.cleanup()` across ranks and, on rank 0 when exporting, call `AutoBridge.export_ckpt` and copy student `config.json` to HF export dir.
Removed Puzzletron Standalone `examples/puzzletron/mbridge_distillation/README.md`, `examples/puzzletron/mbridge_distillation/distill_hf.py`	Deleted standalone Puzzletron mbridge distillation README and `distill_hf.py` (functionality consolidated into megatron_bridge).
Docs — puzzletron `examples/puzzletron/README.md`	Updated link to megatron_bridge distillation docs and noted support for HF and Puzzletron AnyModel checkpoints.
Tests — megatron_bridge `tests/examples/megatron_bridge/test_distill.py`	Added helpers/imports to create tiny HF models and convert to AnyModel; added `test_distill_puzzletron_anymodel`; adjusted existing test to use `pp_size=1`; assert expected checkpoint artifacts.
Removed Tests — puzzletron `tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py`	Removed integration test for deleted standalone distillation script; coverage moved to megatron_bridge tests.
Test Utilities — formatting `tests/_test_utils/torch/puzzletron/utils.py`	Docstring reformatting for `create_and_save_small_hf_model` only (no behavior change).
Test Tokenizer `tests/_test_utils/torch/tokenizer/tokenizer.json`	Expanded BPE vocab with additional special tokens `<
CI Matrix `.github/workflows/example_tests.yml`	Removed `puzzletron` from `nemo-pr` example test matrix so PR runs only `megatron_bridge`.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Launcher as Torchrun Launcher
participant Worker as Distillation Worker (per-rank)
participant Dist as Distributed Backend
participant Exporter as AutoBridge Export
participant FS as Filesystem

Launcher->>Worker: start per-rank distill.py
Worker->>Dist: dist.setup()
Worker->>Worker: run distillation training loop
Worker->>Dist: request checkpoint save (per-iteration)
Worker->>Dist: dist.cleanup() (all ranks)
Note right of Worker: If `--hf_export_path` set
Worker->>Worker: is_rank_0? (rank check)
alt rank_0
    Worker->>Exporter: AutoBridge.export_ckpt(megatron_checkpoint, hf_path, model_name)
    Exporter->>FS: write HF model files
    Worker->>FS: copy student `config.json` -> hf_path
end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main objective of the pull request: consolidating the AnyModel-specific distill_hf.py script into the general distill.py script.
Security Anti-Patterns	✅ Passed	Pull request introduces no new security anti-patterns. Main distill.py correctly implements trust_remote_code as configurable CLI parameter. No unsafe torch.load, numpy.load, hardcoded trust_remote_code, eval/exec, nosec comments, or non-permissive dependencies introduced.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch jrausch/distillation-consolidation

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-09T12:34:47Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1220/
Built to branch `gh-pages` at 2026-04-10 23:44 UTC. Preview will be ready when the GitHub Pages deployment is complete.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/megatron_bridge/distill.py`:
- Line 304: The shutil.copy call copying config.json from args.student_hf_path
will fail for remote HuggingFace model IDs; update the logic in distill.py
(around the shutil.copy usage) to detect if args.student_hf_path is a remote
model ID and in that case use the HuggingFace API (e.g., hf_hub_download or
transformers.AutoConfig.from_pretrained / AutoConfig.save_pretrained) to fetch
the config and write it to args.hf_export_path/config.json, otherwise keep the
local shutil.copy behavior; reference args.student_hf_path and
args.hf_export_path so the code handles both local paths and remote model IDs.

In `@tests/_test_utils/torch/puzzletron/utils.py`:
- Line 89: The hardcoded trust_remote_code=True must be made
caller-configurable: add a boolean parameter trust_remote_code: bool = False to
the function that loads the HF model (the function around lines 66-72 in
tests/_test_utils/torch/puzzletron/utils.py), then pass that parameter into
AutoConfig.from_pretrained(...) and any other pretrained loaders (e.g., the call
at line ~152) instead of the hardcoded True; ensure the new parameter defaults
to False and is threaded through to all transformer loading calls
(AutoConfig.from_pretrained and tokenizer/model loading sites) so callers can
opt in when necessary.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5e8fb78d-7646-49ef-ad4c-8b49d491c83b

📥 Commits

Reviewing files that changed from the base of the PR and between 25266b8 and 6abc8ab.

📒 Files selected for processing (8)

examples/megatron_bridge/README.md
examples/megatron_bridge/distill.py
examples/puzzletron/README.md
examples/puzzletron/mbridge_distillation/README.md
examples/puzzletron/mbridge_distillation/distill_hf.py
tests/_test_utils/torch/puzzletron/utils.py
tests/examples/megatron_bridge/test_distill.py
tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py

💤 Files with no reviewable changes (3)

examples/puzzletron/mbridge_distillation/README.md
tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py
examples/puzzletron/mbridge_distillation/distill_hf.py

examples/megatron_bridge/distill.py

tests/_test_utils/torch/puzzletron/utils.py

examples/megatron_bridge/README.md

kevalmorabia97 · 2026-04-10T12:59:46Z

examples/megatron_bridge/README.md

-For more details, you can refer to the checkpoint conversion scripts in the [Megatron-Bridge README](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/conversion).
+For more details, see the [Megatron-Bridge conversion README](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/conversion).
+
+> **Known limitation:** HF export does not yet work for Puzzletron AnyModel (heterogeneous) checkpoints -- Megatron-Bridge cannot reload heterogeneous configs from saved checkpoints. Standard models export correctly with both methods.


Did you test this in nemo:26.02.00 or nemo:26.02.01? It was fixed in 26.02.01. Please give that a try again

kevalmorabia97 · 2026-04-10T13:00:50Z

examples/megatron_bridge/README.md

+
+> **Known limitation:** HF export does not yet work for Puzzletron AnyModel (heterogeneous) checkpoints -- Megatron-Bridge cannot reload heterogeneous configs from saved checkpoints. Standard models export correctly with both methods.
+
+### Distillation Results


Can you create results/puzzletron.md file and move the results there and add a reference to it here? I plan to add minitron distillation results also so this way we can keep this doc clean.

Alternatively we can keep in examples/puzzletron/README.md also since thats where actual pruning is happening. Either is fine

tests/examples/megatron_bridge/test_distill.py

tests/_test_utils/torch/puzzletron/utils.py

…tion

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

codecov · 2026-04-10T19:07:42Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 61.40%. Comparing base (977d60a) to head (0c2a2ee).

❗ There is a different number of reports uploaded between BASE (977d60a) and HEAD (0c2a2ee). Click for more details.

HEAD has 1 upload less than BASE

Flag BASE (977d60a) HEAD (0c2a2ee)

unit 2 1

Additional details and impacted files

@@                   Coverage Diff                   @@
##           feature/puzzletron    #1220       +/-   ##
=======================================================
- Coverage               75.34%   61.40%   -13.94%     
=======================================================
  Files                     466      462        -4     
  Lines                   48495    48171      -324     
=======================================================
- Hits                    36539    29580     -6959     
- Misses                  11956    18591     +6635

Flag	Coverage Δ
unit	`51.73% <ø> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kevalmorabia97 · 2026-04-10T19:17:18Z

/ok to test 8c2fa10

…tion Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

consolidate mbridge distillation scripts; merge distill_hf.py into di…

6abc8ab

…still.py Signed-off-by: jrausch <jrausch@nvidia.com> Signed-off-by: root <root@pool0-00848.cm.cluster>

j-rausch requested review from a team as code owners April 9, 2026 12:29

j-rausch requested review from ChenhanYu and jenchen13 and removed request for a team April 9, 2026 12:29

kevalmorabia97 requested review from AAnoosheh, danielkorzekwa and kevalmorabia97 and removed request for ChenhanYu and jenchen13 April 9, 2026 12:30

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

examples/megatron_bridge/distill.py Outdated Show resolved Hide resolved

tests/_test_utils/torch/puzzletron/utils.py Show resolved Hide resolved