Releases: NVIDIA/Model-Optimizer
Releases · NVIDIA/Model-Optimizer
ModelOpt 0.43.0 Release
Bug Fixes
- ONNX Runtime dependency upgraded to 1.24 to solve missing graph outputs when using the TensorRT Execution Provider.
Backward Breaking Changes
- Default
--kv_cache_qformatinhf_ptq.pychanged fromfp8tofp8_cast. Existing scripts that rely on the default will now skip KV cache calibration and use a constant amax instead. To restore the previous calibrated behavior, explicitly pass--kv_cache_qformat fp8. - Removed KV cache scale clamping (
clamp_(min=1.0)) in the HF checkpoint export path. Calibrated KV cache scales below 1.0 are now exported as-is. If you observe accuracy degradation with calibrated KV cache (--kv_cache_qformat fp8ornvfp4), consider using the casting methods (fp8_castornvfp4_cast) instead.
New Features
- Add
fp8_castandnvfp4_castmodes for--kv_cache_qformatinhf_ptq.py. These use a constant amax (FP8 E4M3 max, 448.0) without data-driven calibration, since the downstream engine uses FP8 attention math for both FP8 and NVFP4 quantization. A newuse_constant_amaxfield inQuantizerAttributeConfigcontrols this behavior. - User does not need to manually register MOE modules to cover experts calibration coverage in PTQ workflow.
hf_ptq.pynow saves the quantization summary and moe expert token count table to the export directory.- Add
--moe_calib_experts_ratioflag inhf_ptq.pyto specify the ratio of experts to calibrate during forward pass to improve expert coverage during calibration. Default to None (not enabled). - Add sparse attention optimization for transformer models (
modelopt.torch.sparsity.attention_sparsity). This reduces computational cost by skipping attention computation. Supports calibration for threshold selection on HuggingFace models. See examples/llm_sparsity/attention_sparsity/README.md for usage. - Add support for rotating the input before quantization for RHT.
- Add support for advanced weight scale search for NVFP4 quantization and its export path.
- Enable PTQ workflow for Qwen3.5 MoE models.
- Enable PTQ workflow for the Kimi-K2.5 model.
- Add
nvfp4_omlp_onlyquantization format for NVFP4 quantization. This is similar tonvfp4_mlp_onlybut also quantizes the output projection layer in attention. - Add
nvfp4_experts_onlyquantization config that targets only MoE routed expert layers (excluding shared) with NVFP4 quantization. pass_through_bwdin the quantization config is now default to True. Please set it to False if you want to use STE with zeroed outlier gradients for potentially better QAT accuracy.- Add
compute_quantization_mseAPI to measure per-quantizer mean-squared quantization error, with flexible wildcard and callable filtering. - Autotune: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI:
python -m modelopt.onnx.quantization.autotune. See the Autotune guide in the documentation. - Add
get_auto_quantize_configAPI to extract a flat quantization config fromauto_quantizesearch results, enabling re-quantization at different effective bit targets without re-running calibration. - Improve
auto_quantizecheckpoint/resume: calibration state is now saved and restored across runs, avoiding redundant calibration when resuming a search. - Add support for Nemotron-3 (NemotronHForCausalLM) model quantization and support for NemotronH MoE expert support in
auto_quantizegrouping and scoring rules. - Add support for block-granular RHT for non-power-of-2 dimensions.
- Replace modelopt FP8 QDQ nodes with native ONNX QDQ nodes.
Deprecations
- Removed MT-Bench (FastChat) support from
examples/llm_eval. Therun_fastchat.shandgen_model_answer.pyscripts have been deleted, and themtbenchtask has been removed from thellm_ptqexample scripts. - Remove deprecated NeMo-2.0 Framework references.
Misc
- Migrated project metadata from
setup.pyto a fully declarativepyproject.toml. - Enable experimental Python 3.13 wheel support and unit tests in CI/CD.
0.43.0rc4
Install the 0.43.0rc4 pre-release version using
pip install nvidia-modelopt==0.43.0rc4 --extra-index-url https://pypi.nvidia.com
0.43.0rc3
Install the 0.43.0rc3 pre-release version using
pip install nvidia-modelopt==0.43.0rc3 --extra-index-url https://pypi.nvidia.com
0.43.0rc2
Install the 0.43.0rc2 pre-release version using
pip install nvidia-modelopt==0.43.0rc2 --extra-index-url https://pypi.nvidia.com
0.43.0rc1
Install the 0.43.0rc1 pre-release version using
pip install nvidia-modelopt==0.43.0rc1 --extra-index-url https://pypi.nvidia.com
0.43.0rc0
Install the 0.43.0rc0 pre-release version using
pip install nvidia-modelopt[all]==0.43.0rc0 --extra-index-url https://pypi.nvidia.com
ModelOpt 0.42.0 Release
Bug Fixes
- Fix calibration data generation with multiple samples in the ONNX workflow.
New Features
- Added a standalone type inference option (
--use_standalone_type_inference) to ONNX AutoCast as an experimental alternative to ONNX'sinfer_shapes. This option performs type-only inference without shape inference, which can help when shape inference fails or when you want to avoid extra shape inference overhead. - Added quantization support for the Kimi K2 Thinking model from the original int4 checkpoint.
- Introduced support for params constraint-based automatic neural architecture search in Minitron pruning (
mcore_minitron) as an alternative to manual pruning withexport_config. See examples/pruning/README.md for more details. - Example added for Minitron pruning using the Megatron-Bridge framework, including advanced pruning usage with params-constraint-based pruning and a new distillation example. See examples/megatron_bridge/README.md.
- Supported calibration data with multiple samples in
.npzformat in the ONNX Autocast workflow. - Added the
--opsetoption to the ONNX quantization CLI to specify the target opset version for the quantized model. - Enabled support for context parallelism in Eagle speculative decoding for both HuggingFace and Megatron Core models.
- Added unified Hugging Face export support for diffusers pipelines/components.
- Added support for LTX-2 and Wan2.2 (T2V) in the diffusers quantization workflow.
- Provided PTQ support for GLM-4.7, including loading MTP layer weights from a separate
mtp.safetensorsfile and supporting export as-is. - Added support for image-text data calibration in PTQ for Nemotron VL models.
- Enabled advanced weight scale search for NVFP4 quantization and its export pathway.
- Provided PTQ support for Nemotron Parse.
- Added distillation support for LTX-2. See examples/diffusers/distillation/README.md for more details.
0.42.0rc2
Install the 0.42.0rc2 pre-release version using
pip install nvidia-modelopt[all]==0.42.0rc2 --extra-index-url https://pypi.nvidia.com
0.42.0rc1
Install the 0.42.0rc1 pre-release version using
pip install nvidia-modelopt==0.42.0rc1 --extra-index-url https://pypi.nvidia.com
0.42.0rc0
Install the 0.42.0rc0 pre-release version using
pip install nvidia-modelopt==0.42.0rc0 --extra-index-url https://pypi.nvidia.com