Skip to content

Releases: NVIDIA/Model-Optimizer

ModelOpt 0.43.0 Release

16 Apr 19:22
ccabb95

Choose a tag to compare

Bug Fixes

  • ONNX Runtime dependency upgraded to 1.24 to solve missing graph outputs when using the TensorRT Execution Provider.

Backward Breaking Changes

  • Default --kv_cache_qformat in hf_ptq.py changed from fp8 to fp8_cast. Existing scripts that rely on the default will now skip KV cache calibration and use a constant amax instead. To restore the previous calibrated behavior, explicitly pass --kv_cache_qformat fp8.
  • Removed KV cache scale clamping (clamp_(min=1.0)) in the HF checkpoint export path. Calibrated KV cache scales below 1.0 are now exported as-is. If you observe accuracy degradation with calibrated KV cache (--kv_cache_qformat fp8 or nvfp4), consider using the casting methods (fp8_cast or nvfp4_cast) instead.

New Features

  • Add fp8_cast and nvfp4_cast modes for --kv_cache_qformat in hf_ptq.py. These use a constant amax (FP8 E4M3 max, 448.0) without data-driven calibration, since the downstream engine uses FP8 attention math for both FP8 and NVFP4 quantization. A new use_constant_amax field in QuantizerAttributeConfig controls this behavior.
  • User does not need to manually register MOE modules to cover experts calibration coverage in PTQ workflow.
  • hf_ptq.py now saves the quantization summary and moe expert token count table to the export directory.
  • Add --moe_calib_experts_ratio flag in hf_ptq.py to specify the ratio of experts to calibrate during forward pass to improve expert coverage during calibration. Default to None (not enabled).
  • Add sparse attention optimization for transformer models (modelopt.torch.sparsity.attention_sparsity). This reduces computational cost by skipping attention computation. Supports calibration for threshold selection on HuggingFace models. See examples/llm_sparsity/attention_sparsity/README.md for usage.
  • Add support for rotating the input before quantization for RHT.
  • Add support for advanced weight scale search for NVFP4 quantization and its export path.
  • Enable PTQ workflow for Qwen3.5 MoE models.
  • Enable PTQ workflow for the Kimi-K2.5 model.
  • Add nvfp4_omlp_only quantization format for NVFP4 quantization. This is similar to nvfp4_mlp_only but also quantizes the output projection layer in attention.
  • Add nvfp4_experts_only quantization config that targets only MoE routed expert layers (excluding shared) with NVFP4 quantization.
  • pass_through_bwd in the quantization config is now default to True. Please set it to False if you want to use STE with zeroed outlier gradients for potentially better QAT accuracy.
  • Add compute_quantization_mse API to measure per-quantizer mean-squared quantization error, with flexible wildcard and callable filtering.
  • Autotune: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: python -m modelopt.onnx.quantization.autotune. See the Autotune guide in the documentation.
  • Add get_auto_quantize_config API to extract a flat quantization config from auto_quantize search results, enabling re-quantization at different effective bit targets without re-running calibration.
  • Improve auto_quantize checkpoint/resume: calibration state is now saved and restored across runs, avoiding redundant calibration when resuming a search.
  • Add support for Nemotron-3 (NemotronHForCausalLM) model quantization and support for NemotronH MoE expert support in auto_quantize grouping and scoring rules.
  • Add support for block-granular RHT for non-power-of-2 dimensions.
  • Replace modelopt FP8 QDQ nodes with native ONNX QDQ nodes.

Deprecations

  • Removed MT-Bench (FastChat) support from examples/llm_eval. The run_fastchat.sh and gen_model_answer.py scripts have been deleted, and the mtbench task has been removed from the llm_ptq example scripts.
  • Remove deprecated NeMo-2.0 Framework references.

Misc

  • Migrated project metadata from setup.py to a fully declarative pyproject.toml.
  • Enable experimental Python 3.13 wheel support and unit tests in CI/CD.

0.43.0rc4

13 Apr 18:25
ccabb95

Choose a tag to compare

0.43.0rc4 Pre-release
Pre-release

Install the 0.43.0rc4 pre-release version using

pip install nvidia-modelopt==0.43.0rc4 --extra-index-url https://pypi.nvidia.com

0.43.0rc3

06 Apr 15:35
f3151d2

Choose a tag to compare

0.43.0rc3 Pre-release
Pre-release

Install the 0.43.0rc3 pre-release version using

pip install nvidia-modelopt==0.43.0rc3 --extra-index-url https://pypi.nvidia.com

0.43.0rc2

29 Mar 12:30
0315fb1

Choose a tag to compare

0.43.0rc2 Pre-release
Pre-release

Install the 0.43.0rc2 pre-release version using

pip install nvidia-modelopt==0.43.0rc2 --extra-index-url https://pypi.nvidia.com

0.43.0rc1

17 Mar 06:16
00fa5bd

Choose a tag to compare

0.43.0rc1 Pre-release
Pre-release

Install the 0.43.0rc1 pre-release version using

pip install nvidia-modelopt==0.43.0rc1 --extra-index-url https://pypi.nvidia.com

0.43.0rc0

17 Mar 05:45
e4df91b

Choose a tag to compare

0.43.0rc0 Pre-release
Pre-release

Install the 0.43.0rc0 pre-release version using

pip install nvidia-modelopt[all]==0.43.0rc0 --extra-index-url https://pypi.nvidia.com

ModelOpt 0.42.0 Release

09 Mar 20:31
e2a4a8b

Choose a tag to compare

Bug Fixes

  • Fix calibration data generation with multiple samples in the ONNX workflow.

New Features

  • Added a standalone type inference option (--use_standalone_type_inference) to ONNX AutoCast as an experimental alternative to ONNX's infer_shapes. This option performs type-only inference without shape inference, which can help when shape inference fails or when you want to avoid extra shape inference overhead.
  • Added quantization support for the Kimi K2 Thinking model from the original int4 checkpoint.
  • Introduced support for params constraint-based automatic neural architecture search in Minitron pruning (mcore_minitron) as an alternative to manual pruning with export_config. See examples/pruning/README.md for more details.
  • Example added for Minitron pruning using the Megatron-Bridge framework, including advanced pruning usage with params-constraint-based pruning and a new distillation example. See examples/megatron_bridge/README.md.
  • Supported calibration data with multiple samples in .npz format in the ONNX Autocast workflow.
  • Added the --opset option to the ONNX quantization CLI to specify the target opset version for the quantized model.
  • Enabled support for context parallelism in Eagle speculative decoding for both HuggingFace and Megatron Core models.
  • Added unified Hugging Face export support for diffusers pipelines/components.
  • Added support for LTX-2 and Wan2.2 (T2V) in the diffusers quantization workflow.
  • Provided PTQ support for GLM-4.7, including loading MTP layer weights from a separate mtp.safetensors file and supporting export as-is.
  • Added support for image-text data calibration in PTQ for Nemotron VL models.
  • Enabled advanced weight scale search for NVFP4 quantization and its export pathway.
  • Provided PTQ support for Nemotron Parse.
  • Added distillation support for LTX-2. See examples/diffusers/distillation/README.md for more details.

0.42.0rc2

28 Feb 18:32
eaf5d7e

Choose a tag to compare

0.42.0rc2 Pre-release
Pre-release

Install the 0.42.0rc2 pre-release version using

pip install nvidia-modelopt[all]==0.42.0rc2 --extra-index-url https://pypi.nvidia.com

0.42.0rc1

21 Feb 14:50
f08a65f

Choose a tag to compare

0.42.0rc1 Pre-release
Pre-release

Install the 0.42.0rc1 pre-release version using

pip install nvidia-modelopt==0.42.0rc1 --extra-index-url https://pypi.nvidia.com

0.42.0rc0

04 Feb 05:34
87237e7

Choose a tag to compare

0.42.0rc0 Pre-release
Pre-release

Install the 0.42.0rc0 pre-release version using

pip install nvidia-modelopt==0.42.0rc0 --extra-index-url https://pypi.nvidia.com