Releases · NVIDIA/Model-Optimizer

16 Apr 19:22

kevalmorabia97

0.43.0

ccabb95

ModelOpt 0.43.0 Release Latest

Latest

Bug Fixes

ONNX Runtime dependency upgraded to 1.24 to solve missing graph outputs when using the TensorRT Execution Provider.

Backward Breaking Changes

Default --kv_cache_qformat in hf_ptq.py changed from fp8 to fp8_cast. Existing scripts that rely on the default will now skip KV cache calibration and use a constant amax instead. To restore the previous calibrated behavior, explicitly pass --kv_cache_qformat fp8.
Removed KV cache scale clamping (clamp_(min=1.0)) in the HF checkpoint export path. Calibrated KV cache scales below 1.0 are now exported as-is. If you observe accuracy degradation with calibrated KV cache (--kv_cache_qformat fp8 or nvfp4), consider using the casting methods (fp8_cast or nvfp4_cast) instead.

New Features

Add fp8_cast and nvfp4_cast modes for --kv_cache_qformat in hf_ptq.py. These use a constant amax (FP8 E4M3 max, 448.0) without data-driven calibration, since the downstream engine uses FP8 attention math for both FP8 and NVFP4 quantization. A new use_constant_amax field in QuantizerAttributeConfig controls this behavior.
User does not need to manually register MOE modules to cover experts calibration coverage in PTQ workflow.
hf_ptq.py now saves the quantization summary and moe expert token count table to the export directory.
Add --moe_calib_experts_ratio flag in hf_ptq.py to specify the ratio of experts to calibrate during forward pass to improve expert coverage during calibration. Default to None (not enabled).
Add sparse attention optimization for transformer models (modelopt.torch.sparsity.attention_sparsity). This reduces computational cost by skipping attention computation. Supports calibration for threshold selection on HuggingFace models. See examples/llm_sparsity/attention_sparsity/README.md for usage.
Add support for rotating the input before quantization for RHT.
Add support for advanced weight scale search for NVFP4 quantization and its export path.
Enable PTQ workflow for Qwen3.5 MoE models.
Enable PTQ workflow for the Kimi-K2.5 model.
Add nvfp4_omlp_only quantization format for NVFP4 quantization. This is similar to nvfp4_mlp_only but also quantizes the output projection layer in attention.
Add nvfp4_experts_only quantization config that targets only MoE routed expert layers (excluding shared) with NVFP4 quantization.
pass_through_bwd in the quantization config is now default to True. Please set it to False if you want to use STE with zeroed outlier gradients for potentially better QAT accuracy.
Add compute_quantization_mse API to measure per-quantizer mean-squared quantization error, with flexible wildcard and callable filtering.
Autotune: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: python -m modelopt.onnx.quantization.autotune. See the Autotune guide in the documentation.
Add get_auto_quantize_config API to extract a flat quantization config from auto_quantize search results, enabling re-quantization at different effective bit targets without re-running calibration.
Improve auto_quantize checkpoint/resume: calibration state is now saved and restored across runs, avoiding redundant calibration when resuming a search.
Add support for Nemotron-3 (NemotronHForCausalLM) model quantization and support for NemotronH MoE expert support in auto_quantize grouping and scoring rules.
Add support for block-granular RHT for non-power-of-2 dimensions.
Replace modelopt FP8 QDQ nodes with native ONNX QDQ nodes.

Deprecations

Removed MT-Bench (FastChat) support from examples/llm_eval. The run_fastchat.sh and gen_model_answer.py scripts have been deleted, and the mtbench task has been removed from the llm_ptq example scripts.
Remove deprecated NeMo-2.0 Framework references.

Misc

Migrated project metadata from setup.py to a fully declarative pyproject.toml.
Enable experimental Python 3.13 wheel support and unit tests in CI/CD.

Assets 3

13 Apr 18:25

kevalmorabia97

0.43.0rc4

ccabb95

0.43.0rc4 Pre-release

Pre-release

Install the 0.43.0rc4 pre-release version using

pip install nvidia-modelopt==0.43.0rc4 --extra-index-url https://pypi.nvidia.com

Assets 3

06 Apr 15:35

kevalmorabia97

0.43.0rc3

f3151d2

0.43.0rc3 Pre-release

Pre-release

Install the 0.43.0rc3 pre-release version using

pip install nvidia-modelopt==0.43.0rc3 --extra-index-url https://pypi.nvidia.com

Assets 3

29 Mar 12:30

kevalmorabia97

0.43.0rc2

0315fb1

0.43.0rc2 Pre-release

Pre-release

Install the 0.43.0rc2 pre-release version using

pip install nvidia-modelopt==0.43.0rc2 --extra-index-url https://pypi.nvidia.com

Assets 3

17 Mar 06:16

kevalmorabia97

0.43.0rc1

00fa5bd

0.43.0rc1 Pre-release

Pre-release

Install the 0.43.0rc1 pre-release version using

pip install nvidia-modelopt==0.43.0rc1 --extra-index-url https://pypi.nvidia.com

Assets 3

17 Mar 05:45

kevalmorabia97

0.43.0rc0

e4df91b

0.43.0rc0 Pre-release

Pre-release

Install the 0.43.0rc0 pre-release version using

pip install nvidia-modelopt[all]==0.43.0rc0 --extra-index-url https://pypi.nvidia.com

Assets 3

09 Mar 20:31

kevalmorabia97

0.42.0

e2a4a8b

ModelOpt 0.42.0 Release

Bug Fixes

Fix calibration data generation with multiple samples in the ONNX workflow.

New Features

Added a standalone type inference option (--use_standalone_type_inference) to ONNX AutoCast as an experimental alternative to ONNX's infer_shapes. This option performs type-only inference without shape inference, which can help when shape inference fails or when you want to avoid extra shape inference overhead.
Added quantization support for the Kimi K2 Thinking model from the original int4 checkpoint.
Introduced support for params constraint-based automatic neural architecture search in Minitron pruning (mcore_minitron) as an alternative to manual pruning with export_config. See examples/pruning/README.md for more details.
Example added for Minitron pruning using the Megatron-Bridge framework, including advanced pruning usage with params-constraint-based pruning and a new distillation example. See examples/megatron_bridge/README.md.
Supported calibration data with multiple samples in .npz format in the ONNX Autocast workflow.
Added the --opset option to the ONNX quantization CLI to specify the target opset version for the quantized model.
Enabled support for context parallelism in Eagle speculative decoding for both HuggingFace and Megatron Core models.
Added unified Hugging Face export support for diffusers pipelines/components.
Added support for LTX-2 and Wan2.2 (T2V) in the diffusers quantization workflow.
Provided PTQ support for GLM-4.7, including loading MTP layer weights from a separate mtp.safetensors file and supporting export as-is.
Added support for image-text data calibration in PTQ for Nemotron VL models.
Enabled advanced weight scale search for NVFP4 quantization and its export pathway.
Provided PTQ support for Nemotron Parse.
Added distillation support for LTX-2. See examples/diffusers/distillation/README.md for more details.

Assets 3

28 Feb 18:32

kevalmorabia97

0.42.0rc2

eaf5d7e

0.42.0rc2 Pre-release

Pre-release

Install the 0.42.0rc2 pre-release version using

pip install nvidia-modelopt[all]==0.42.0rc2 --extra-index-url https://pypi.nvidia.com

Assets 3

21 Feb 14:50

kevalmorabia97

0.42.0rc1

f08a65f

0.42.0rc1 Pre-release

Pre-release

Install the 0.42.0rc1 pre-release version using

pip install nvidia-modelopt==0.42.0rc1 --extra-index-url https://pypi.nvidia.com

Assets 3

04 Feb 05:34

kevalmorabia97

0.42.0rc0

87237e7

0.42.0rc0 Pre-release

Pre-release

Install the 0.42.0rc0 pre-release version using

pip install nvidia-modelopt==0.42.0rc0 --extra-index-url https://pypi.nvidia.com

Assets 3

Releases: NVIDIA/Model-Optimizer

ModelOpt 0.43.0 Release

Bug Fixes

Backward Breaking Changes

New Features

Deprecations

Misc

Uh oh!

0.43.0rc4

Uh oh!

0.43.0rc3

Uh oh!

0.43.0rc2

Uh oh!

0.43.0rc1

Uh oh!

0.43.0rc0

Uh oh!

ModelOpt 0.42.0 Release

Bug Fixes

New Features

Uh oh!

0.42.0rc2

Uh oh!

0.42.0rc1

Uh oh!

0.42.0rc0

Uh oh!