Skip to content

consolidate mbridge distillation: merge distill_hf.py into distill.py#1220

Open
j-rausch wants to merge 4 commits intofeature/puzzletronfrom
jrausch/distillation-consolidation
Open

consolidate mbridge distillation: merge distill_hf.py into distill.py#1220
j-rausch wants to merge 4 commits intofeature/puzzletronfrom
jrausch/distillation-consolidation

Conversation

@j-rausch
Copy link
Copy Markdown
Contributor

@j-rausch j-rausch commented Apr 9, 2026

Summary

  • Unified examples/puzzletron/mbridge_distillation/distill_hf.py (AnyModel-specific) into examples/megatron_bridge/distill.py (general)
    • The single script now handles both standard HF and Puzzletron AnyModel checkpoints.
  • Added --hf_export_path / --student_hf_model args for inline HF export after distillation.
  • Merged AnyModel integration test into tests/examples/megatron_bridge/test_distill.py
    • test models use vocab_size=128 (instead of default 102) for TP divisibility including 8.
  • Moved MMLU distillation results into megatron_bridge/README.md
    • puzzletron README now redirects to the consolidated docs.

Limitation discovered during consolidation:
HF export via --hf_export_path seems to currently not work for Puzzletron AnyModel (heterogeneous) checkpoints. Megatron-Bridge's export_ckpt cannot reload heterogeneous model configs from saved checkpoints (heterogeneous_layers_config_encoded_json is None during __post_init__ in heterogeneous_config.py). This affects both inline --hf_export_path and the separate convert_checkpoints.py export script.

The original distill_hf.py README documented this as supported, but I think it might have been broken there too (on the side of Megatron-Bridge). The consolidated README now documents this as a known limitation. HF export for standard models works fine via both methods.

Summary by CodeRabbit

  • New Features

    • Optional export of distilled checkpoints to HuggingFace format at distillation completion; distillation now accepts both standard HuggingFace and Puzzletron AnyModel checkpoints (HF export unsupported for heterogeneous AnyModel checkpoints).
  • Documentation

    • Consolidated distillation guide with two export workflows, clarified output locations, MMLU comparison tables (including a regression) and recommendations; removed the older Puzzletron-specific guide.
  • Tests

    • Added integration test for Puzzletron AnyModel distillation; removed HF-only distillation test and the HF-only distillation script.
  • Chores

    • Example test matrix updated to run only Megatron Bridge examples; test tokenizer fixtures extended with additional special tokens.

…still.py

Signed-off-by: jrausch <jrausch@nvidia.com>
Signed-off-by: root <root@pool0-00848.cm.cluster>
@j-rausch j-rausch requested review from a team as code owners April 9, 2026 12:29
@j-rausch j-rausch requested review from ChenhanYu and jenchen13 and removed request for a team April 9, 2026 12:29
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 9, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 9, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 84089e08-12d5-41e3-ba6b-01ba0aa24b2c

📥 Commits

Reviewing files that changed from the base of the PR and between 3c0fa5a and 0c2a2ee.

📒 Files selected for processing (1)
  • tests/_test_utils/torch/puzzletron/utils.py
✅ Files skipped from review due to trivial changes (1)
  • tests/_test_utils/torch/puzzletron/utils.py

📝 Walkthrough

Walkthrough

Consolidates Megatron-Bridge distillation into examples/megatron_bridge/distill.py, adds optional HuggingFace export flags and guarded Puzzletron export import, removes the standalone Puzzletron mbridge distillation script and its test, updates docs/tests for Puzzletron AnyModel support, and expands test tokenizer vocab.

Changes

Cohort / File(s) Summary
Docs — megatron_bridge
examples/megatron_bridge/README.md
Revamped distillation docs: state distill.py accepts HF and Puzzletron AnyModel checkpoints; document inline HF export flags (--hf_export_path, --student_hf_model) and alternative convert_checkpoints.py export workflow; note HF export limitation for heterogeneous AnyModel checkpoints; add MMLU results and recommendations.
Distillation Script
examples/megatron_bridge/distill.py
Added CLI args --hf_export_path, --student_hf_model (validation when exporting); added guarded import of Puzzletron export module; extended end-of-distillation flow to call dist.cleanup() across ranks and, on rank 0 when exporting, call AutoBridge.export_ckpt and copy student config.json to HF export dir.
Removed Puzzletron Standalone
examples/puzzletron/mbridge_distillation/README.md, examples/puzzletron/mbridge_distillation/distill_hf.py
Deleted standalone Puzzletron mbridge distillation README and distill_hf.py (functionality consolidated into megatron_bridge).
Docs — puzzletron
examples/puzzletron/README.md
Updated link to megatron_bridge distillation docs and noted support for HF and Puzzletron AnyModel checkpoints.
Tests — megatron_bridge
tests/examples/megatron_bridge/test_distill.py
Added helpers/imports to create tiny HF models and convert to AnyModel; added test_distill_puzzletron_anymodel; adjusted existing test to use pp_size=1; assert expected checkpoint artifacts.
Removed Tests — puzzletron
tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py
Removed integration test for deleted standalone distillation script; coverage moved to megatron_bridge tests.
Test Utilities — formatting
tests/_test_utils/torch/puzzletron/utils.py
Docstring reformatting for create_and_save_small_hf_model only (no behavior change).
Test Tokenizer
tests/_test_utils/torch/tokenizer/tokenizer.json
Expanded BPE vocab with additional special tokens `<
CI Matrix
.github/workflows/example_tests.yml
Removed puzzletron from nemo-pr example test matrix so PR runs only megatron_bridge.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Launcher as Torchrun Launcher
participant Worker as Distillation Worker (per-rank)
participant Dist as Distributed Backend
participant Exporter as AutoBridge Export
participant FS as Filesystem

Launcher->>Worker: start per-rank distill.py
Worker->>Dist: dist.setup()
Worker->>Worker: run distillation training loop
Worker->>Dist: request checkpoint save (per-iteration)
Worker->>Dist: dist.cleanup() (all ranks)
Note right of Worker: If `--hf_export_path` set
Worker->>Worker: is_rank_0? (rank check)
alt rank_0
    Worker->>Exporter: AutoBridge.export_ckpt(megatron_checkpoint, hf_path, model_name)
    Exporter->>FS: write HF model files
    Worker->>FS: copy student `config.json` -> hf_path
end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main objective of the pull request: consolidating the AnyModel-specific distill_hf.py script into the general distill.py script.
Security Anti-Patterns ✅ Passed Pull request introduces no new security anti-patterns. Main distill.py correctly implements trust_remote_code as configurable CLI parameter. No unsafe torch.load, numpy.load, hardcoded trust_remote_code, eval/exec, nosec comments, or non-permissive dependencies introduced.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jrausch/distillation-consolidation

Comment @coderabbitai help to get the list of available commands and usage tips.

@kevalmorabia97 kevalmorabia97 requested review from AAnoosheh, danielkorzekwa and kevalmorabia97 and removed request for ChenhanYu and jenchen13 April 9, 2026 12:30
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1220/

Built to branch gh-pages at 2026-04-10 23:44 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/megatron_bridge/distill.py`:
- Line 304: The shutil.copy call copying config.json from args.student_hf_path
will fail for remote HuggingFace model IDs; update the logic in distill.py
(around the shutil.copy usage) to detect if args.student_hf_path is a remote
model ID and in that case use the HuggingFace API (e.g., hf_hub_download or
transformers.AutoConfig.from_pretrained / AutoConfig.save_pretrained) to fetch
the config and write it to args.hf_export_path/config.json, otherwise keep the
local shutil.copy behavior; reference args.student_hf_path and
args.hf_export_path so the code handles both local paths and remote model IDs.

In `@tests/_test_utils/torch/puzzletron/utils.py`:
- Line 89: The hardcoded trust_remote_code=True must be made
caller-configurable: add a boolean parameter trust_remote_code: bool = False to
the function that loads the HF model (the function around lines 66-72 in
tests/_test_utils/torch/puzzletron/utils.py), then pass that parameter into
AutoConfig.from_pretrained(...) and any other pretrained loaders (e.g., the call
at line ~152) instead of the hardcoded True; ensure the new parameter defaults
to False and is threaded through to all transformer loading calls
(AutoConfig.from_pretrained and tokenizer/model loading sites) so callers can
opt in when necessary.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5e8fb78d-7646-49ef-ad4c-8b49d491c83b

📥 Commits

Reviewing files that changed from the base of the PR and between 25266b8 and 6abc8ab.

📒 Files selected for processing (8)
  • examples/megatron_bridge/README.md
  • examples/megatron_bridge/distill.py
  • examples/puzzletron/README.md
  • examples/puzzletron/mbridge_distillation/README.md
  • examples/puzzletron/mbridge_distillation/distill_hf.py
  • tests/_test_utils/torch/puzzletron/utils.py
  • tests/examples/megatron_bridge/test_distill.py
  • tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py
💤 Files with no reviewable changes (3)
  • examples/puzzletron/mbridge_distillation/README.md
  • tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py
  • examples/puzzletron/mbridge_distillation/distill_hf.py

For more details, you can refer to the checkpoint conversion scripts in the [Megatron-Bridge README](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/conversion).
For more details, see the [Megatron-Bridge conversion README](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/conversion).

> **Known limitation:** HF export does not yet work for Puzzletron AnyModel (heterogeneous) checkpoints -- Megatron-Bridge cannot reload heterogeneous configs from saved checkpoints. Standard models export correctly with both methods.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test this in nemo:26.02.00 or nemo:26.02.01? It was fixed in 26.02.01. Please give that a try again


> **Known limitation:** HF export does not yet work for Puzzletron AnyModel (heterogeneous) checkpoints -- Megatron-Bridge cannot reload heterogeneous configs from saved checkpoints. Standard models export correctly with both methods.

### Distillation Results
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create results/puzzletron.md file and move the results there and add a reference to it here? I plan to add minitron distillation results also so this way we can keep this doc clean.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we can keep in examples/puzzletron/README.md also since thats where actual pruning is happening. Either is fine

@kevalmorabia97 kevalmorabia97 force-pushed the jrausch/distillation-consolidation branch from 4ed9996 to 8c2fa10 Compare April 10, 2026 18:54
@kevalmorabia97 kevalmorabia97 requested a review from a team as a code owner April 10, 2026 18:54
@kevalmorabia97 kevalmorabia97 requested review from kevalmorabia97 and removed request for a team April 10, 2026 18:54
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 61.40%. Comparing base (977d60a) to head (0c2a2ee).

❗ There is a different number of reports uploaded between BASE (977d60a) and HEAD (0c2a2ee). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (977d60a) HEAD (0c2a2ee)
unit 2 1
Additional details and impacted files
@@                   Coverage Diff                   @@
##           feature/puzzletron    #1220       +/-   ##
=======================================================
- Coverage               75.34%   61.40%   -13.94%     
=======================================================
  Files                     466      462        -4     
  Lines                   48495    48171      -324     
=======================================================
- Hits                    36539    29580     -6959     
- Misses                  11956    18591     +6635     
Flag Coverage Δ
unit 51.73% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kevalmorabia97
Copy link
Copy Markdown
Collaborator

/ok to test 8c2fa10

@kevalmorabia97 kevalmorabia97 requested review from a team as code owners April 10, 2026 23:38
@kevalmorabia97 kevalmorabia97 requested review from realAsma and removed request for a team April 10, 2026 23:38
…tion

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 force-pushed the jrausch/distillation-consolidation branch from 3c0fa5a to 0c2a2ee Compare April 10, 2026 23:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants