NVIDIA
diff --git a/‎examples/speculative_decoding/doc/dflash.md‎
Lines changed: 21 additions & 7 deletions b/‎examples/speculative_decoding/doc/dflash.md‎
Lines changed: 21 additions & 7 deletions
diff --git a/‎examples/speculative_decoding/eagle_utils.py‎
Lines changed: 14 additions & 8 deletions b/‎examples/speculative_decoding/eagle_utils.py‎
Lines changed: 14 additions & 8 deletions
diff --git a/‎examples/speculative_decoding/main.py‎
Lines changed: 11 additions & 3 deletions b/‎examples/speculative_decoding/main.py‎
Lines changed: 11 additions & 3 deletions
diff --git a/‎examples/speculative_decoding/requirements.txt‎
Lines changed: 1 addition & 1 deletion b/‎examples/speculative_decoding/requirements.txt‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/speculative_decoding/scripts/export_hf_checkpoint.py‎
Lines changed: 4 additions & 1 deletion b/‎examples/speculative_decoding/scripts/export_hf_checkpoint.py‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎modelopt/torch/export/plugins/hf_spec_export.py‎
Lines changed: 18 additions & 7 deletions b/‎modelopt/torch/export/plugins/hf_spec_export.py‎
Lines changed: 18 additions & 7 deletions
diff --git a/‎modelopt/torch/speculative/config.py‎
Lines changed: 1 addition & 1 deletion b/‎modelopt/torch/speculative/config.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎modelopt/torch/speculative/dflash/default_config.py‎
Lines changed: 9 additions & 2 deletions b/‎modelopt/torch/speculative/dflash/default_config.py‎
Lines changed: 9 additions & 2 deletions
diff --git a/‎modelopt/torch/speculative/plugins/__init__.py‎
Lines changed: 1 addition & 3 deletions b/‎modelopt/torch/speculative/plugins/__init__.py‎
Lines changed: 1 addition & 3 deletions
@@ -9,7 +9,7 @@ Reference: [arXiv:2602.06036](https://arxiv.org/abs/2602.06036) |
 
 ## Architecture
 
-```
+```text
 Target Model (frozen)
   │
   ├─ hidden_states[layer 1, 9, 17, 25, 33]  ──► concat ──► FC + RMSNorm ──► target_hidden
@@ -43,7 +43,7 @@ Target Model (frozen)
 
 Given context `"The answer is"` and block_size=4 with anchor `"is"`:
 
-```
+```text
 Target model hidden states (from frozen base model):
   h["The"]  h["answer"]  h["is"]     ← target_hidden (ctx_len=3)
      │          │           │
@@ -87,7 +87,7 @@ In each DFlash decoder layer:
 
 **Training vs Inference:**
 
-```
+```text
 TRAINING (2 anchors, block_size=4):
 
   Context tokens:  "The"  "answer"  "is"   "5"    "."
@@ -166,13 +166,25 @@ See [`modelopt_recipes/general/speculative_decoding/dflash.yaml`](../../../model
 | `dflash.dflash_architecture_config.mask_token_id` | auto | Token ID for masked positions |
 | `training.answer_only_loss` | false | Mask loss on non-assistant tokens |
 
+> **Note on `answer_only_loss` and chat templates:** When `answer_only_loss=true`, the
+> dataset loader replaces the tokenizer's chat template with a simplified version that has
+> `{% generation %}` tags to identify assistant turns. This simplified template may not
+> support all features of the original (e.g., tool use formatting, multi-turn system
+> prompts). During serving, the draft model reuses the target model's original tokenizer
+> and template, so there is no train/inference mismatch in the tokenization itself — only
+> the loss masking during training uses the simplified template. However, if training data
+> contains tool-use conversations with model-family-specific formatting, the simplified
+> template may tokenize them differently, affecting which tokens get masked. For best
+> results with tool-use data, set `answer_only_loss=false` or provide a custom
+> `chat_template` that supports both generation tags and tool-use formatting.
+
 ### Random Anchor Sampling (`num_anchors`)
 
 During training, anchor positions are sampled randomly from valid (assistant response)
 tokens in each batch, rather than dividing the sequence into fixed blocks. Each anchor
 starts a block of `block_size` tokens where the draft model predicts positions 1..B-1.
 
-```
+```text
 Sequence:  [SYS] You helpful [USR] What 2+3? [AST] The answer is 5
 Position:    0    1     2      3     4    5     6    7    8    9  10
 loss_mask:   0    0     0      0     0    0     0    1    1    1   1
@@ -208,7 +220,7 @@ The exponential decay factor (gamma) weights early block positions higher than l
 If position 1 in a block is wrong, all subsequent positions are rejected in speculative
 decoding. Decay aligns the training loss with what matters for acceptance rate.
 
-```
+```text
 weight[k] = exp(-(k-1).clamp(min=0) / gamma)    for k = 0..B-1
 ```
 
@@ -324,8 +336,8 @@ ModelOpt wins acceptance length on 7/8 categories and TPS on 8/8 categories.
 - **FP8 / NVFP4 quantization**: Export pipeline supports quantized checkpoints via
   `hf_ptq.py` (PTQ succeeded in testing). AR impact of quantization not yet measured.
   The flow: train (bf16) → `mtq.quantize(model, quant_cfg)` → `export_hf_checkpoint.py`.
-- **Checkpoint resume**: `DFlashModule._apply()` handles meta-tensor rotary buffers.
-  Validated in training runs but not covered by integration tests.
+- **Checkpoint resume**: `DFlashModule._apply()` handles meta-tensor rotary buffers
+  (one-shot check on first `.to(device)` call). Validated in train+resume E2E tests.
 
 ### Validated
 
@@ -334,10 +346,12 @@ ModelOpt wins acceptance length on 7/8 categories and TPS on 8/8 categories.
 - **AR evaluation**: `ar_validate.py` with online GT, per-category MT-Bench.
 - **vLLM deployment**: Speculative decoding with `vllm/vllm-openai:nightly` (v0.19.1+).
   3.1x speedup over baseline. Per-category benchmarks on MT-Bench.
+
   ```bash
   vllm serve Qwen/Qwen3-8B \
       --speculative-config '{"method": "dflash", "model": "path/to/checkpoint", "num_speculative_tokens": 7}' \
       --max-num-batched-tokens 32768
   ```
+
 - **Export**: z-lab compatible HF format, loadable by vLLM and z-lab benchmark.
 - **Loss decay**: Validated +0.12 AR improvement with gamma=7 (bs16).
@@ -137,12 +137,19 @@ def __call__(self, features: list[dict[str, Any]]) -> dict[str, Any]:
         return batch
 
 
-def make_eagle_supervised_data_module(
+def make_speculative_data_module(
     tokenizer: transformers.PreTrainedTokenizer,
     data_args,
     train_len=None,
     answer_only_loss=False,
+    shift_labels=True,
 ) -> dict:
+    """Create data module for speculative decoding training.
+
+    Args:
+        shift_labels: If True, labels are shifted by 1 for autoregressive training (EAGLE3).
+            If False, labels are unshifted for diffusion-style training (DFlash).
+    """
     if data_args.offline_data_path is None:
         train_dataset = ShardedDataset("json", data_files=data_args.data_path)
 
@@ -152,6 +159,7 @@ def make_eagle_supervised_data_module(
                 train_len=train_len,
                 return_labels=True,
                 answer_only_loss=answer_only_loss,
+                shift_labels=shift_labels,
             )
         else:
             data_collator = VisionLanguageDataCollator(
@@ -213,6 +221,11 @@ def on_log(self, args, state, control, **kwargs):
             print_rank_0(f"Step {state.global_step} Training Acc: [{acc_str}]")
         except Exception:
             print_rank_0(f"Step {state.global_step} Training Acc: {average_acc}")
+        # Log accuracy to HF Trainer's logs dict (picked up by TensorBoard)
+        logs = kwargs.get("logs") or {}
+        for i, draft_acc in enumerate(average_acc):
+            for j, step_acc in enumerate(draft_acc):
+                logs[f"train_acc/parallel_{i}_step_{j}"] = float(step_acc)
         if self.estimate_ar:
             # Calculate mean training AR since last log
             # NOTE: This is only an estimate of the real AR.
@@ -226,13 +239,6 @@ def on_log(self, args, state, control, **kwargs):
                 acc_cumprod *= draft_acc[-1]
                 est_ar += acc_cumprod
             print_rank_0(f"Step {state.global_step} Estimated Training AR: {est_ar:.4f}")
-
-        # Log accuracy to HF Trainer's logs dict (picked up by TensorBoard)
-        logs = kwargs.get("logs") or {}
-        for i, draft_acc in enumerate(average_acc):
-            for j, step_acc in enumerate(draft_acc):
-                logs[f"train_acc/parallel_{i}_step_{j}"] = float(step_acc)
-        if self.estimate_ar:
             logs["estimated_training_ar"] = est_ar
 
         # log to wandb
 
@@ -40,7 +40,7 @@
 from eagle_utils import (
     EagleTrainerWithAccLog,
     EagleTrainingPlot,
-    make_eagle_supervised_data_module,
+    make_speculative_data_module,
     patch_ring_attention_for_ttt,
 )
 from omegaconf import OmegaConf
@@ -108,6 +108,12 @@ class TrainingArguments(transformers.TrainingArguments):
         default=False, metadata={"help": "Whether to estimate AR using training accuracy to log."}
     )
     ar_validate_steps: int = field(default=1000, metadata={"help": "AR validation interval."})
+    answer_only_loss: bool = field(
+        default=False,
+        metadata={
+            "help": "Mask loss on non-assistant tokens. Default: True for dflash, False for eagle3."
+        },
+    )
     cp_size: int = field(default=1, metadata={"help": "Context parallelism size."})
     dp_shard_size: int | None = field(
         default=None,
@@ -262,12 +268,14 @@ def train():
             raise Exception(f"{training_args.mode} is not supported!")
 
     print_rank_0("Loading dataset...")
+    is_dflash = training_args.mode == "dflash"
     if training_args.mode in ("eagle3", "dflash"):
-        data_module = make_eagle_supervised_data_module(
+        data_module = make_speculative_data_module(
             tokenizer,
             data_args,
             train_len=training_args.training_seq_len,
-            answer_only_loss=(training_args.mode == "dflash"),
+            answer_only_loss=training_args.answer_only_loss,
+            shift_labels=not is_dflash,
         )
 
     trainer = EagleTrainerWithAccLog(
 
@@ -1,2 +1,2 @@
 accelerate==1.12.0
-transformers<5.4
+transformers>=4.58,<5.4
@@ -29,6 +29,7 @@ def parse_args():
         description="Export a HF checkpoint (with ModelOpt state) for deployment."
     )
     parser.add_argument("--model_path", type=str, default="Path of the trained checkpoint.")
+    parser.add_argument("--trust_remote_code", action="store_true", help="Trust remote code")
     parser.add_argument(
         "--export_path", type=str, default="Destination directory for exported files."
     )
@@ -38,7 +39,9 @@ def parse_args():
 mto.enable_huggingface_checkpointing()
 
 args = parse_args()
-model = load_vlm_or_llm(args.model_path, torch_dtype="auto")
+model = load_vlm_or_llm(
+    args.model_path, torch_dtype="auto", trust_remote_code=args.trust_remote_code
+)
 model.eval()
 with torch.inference_mode():
     export_speculative_decoding(
 
@@ -253,10 +253,6 @@ class DFlashExporter(SpeculativeDecodingExporter):
     - config.json: Qwen3-style config with dflash_config field
     """
 
-    def __init__(self, model: nn.Module):
-        """Initialize the DFlashExporter."""
-        super().__init__(model)
-
     def _extract_state_dict(self, full_state_dict: dict):
         """Extract DFlash module weights, stripping the dflash_module prefix."""
         export_sd = {}
@@ -316,7 +312,9 @@ def _export_config(self):
             ),
             "rope_scaling": getattr(base_config, "rope_scaling", None),
             "tie_word_embeddings": False,
-            "torch_dtype": str(getattr(base_config, "torch_dtype", torch.bfloat16)).replace("torch.", ""),
+            "torch_dtype": str(getattr(base_config, "torch_dtype", torch.bfloat16)).replace(
+                "torch.", ""
+            ),
             "num_target_layers": getattr(base_config, "num_hidden_layers", 36),
         }
 
@@ -333,18 +331,31 @@ def export(self, export_dir: Path | str, dtype: torch.dtype | None = None):
         export_dir = Path(export_dir)
         export_dir.mkdir(parents=True, exist_ok=True)
 
+        # Export quantized modules if applicable
+        if has_quant_opt(self.model):
+            from ..unified_export_hf import _export_transformers_checkpoint
+
+            full_sd, hf_quant_config = _export_transformers_checkpoint(self.model, dtype)
+        else:
+            full_sd, hf_quant_config = self.model.state_dict(), None
+
         # Export state dict
-        full_sd = self.model.state_dict()
         drafter_sd = self._extract_state_dict(full_sd)
-        if dtype is not None:
+        if dtype is not None and hf_quant_config is None:
             drafter_sd = {k: v.to(dtype) for k, v in drafter_sd.items()}
         save_file(drafter_sd, f"{export_dir}/model.safetensors")
 
         # Export config
         drafter_config = self._export_config()
+        if hf_quant_config is not None:
+            drafter_config["quantization_config"] = hf_quant_config
         with open(f"{export_dir}/config.json", "w") as f:
             json.dump(drafter_config, f, indent=2)
 
+        if hf_quant_config is not None:
+            with open(f"{export_dir}/hf_quant_config.json", "w") as f:
+                json.dump(hf_quant_config, f, indent=2)
+
         print(
             f"Exported DFlash draft model: {len(drafter_sd)} tensors, "
             f"config keys: {list(drafter_config.keys())[:5]}..."
 
@@ -64,7 +64,7 @@ class DFlashConfig(ModeloptBaseConfig):
     """DFlash config for block-wise parallel speculative decoding."""
 
     dflash_block_size: int = ModeloptField(
-        default=16,
+        default=8,
         description="Block size for parallel prediction. Draft predicts this many tokens per block.",
     )
 
 
@@ -16,13 +16,20 @@
 """Default DFlash architecture config.
 
 Model-specific settings (hidden_size, num_attention_heads, rope_*, etc.)
-are inherited from the base model in HFDFlashModel.modify(). Only
-DFlash-specific defaults are set here.
+are inherited from the base model in HFDFlashModel.modify(). Static
+defaults that don't depend on the base model are set here, similar to
+``eagle/default_config.py``.
 """
 
 default_dflash_config = {
+    # DFlash-specific
     "num_hidden_layers": 5,
+    # Architecture defaults (overridable by user config)
+    "hidden_act": "silu",
     "rms_norm_eps": 1e-06,
+    "initializer_range": 0.02,
     "attention_bias": False,
     "attention_dropout": 0.0,
+    "tie_word_embeddings": False,
+    "_attn_implementation": "sdpa",
 }
@@ -30,7 +30,5 @@
     from .megatron_medusa import *
 
 with import_plugin("transformers"):
-    from .transformers import *
-
-with import_plugin("hf_dflash"):
     from .hf_dflash import *
+    from .transformers import *
Original file line number	Diff line number	Diff line change
`@@ -1,2 +1,2 @@`
`1`	`1`	`accelerate==1.12.0`
`2`		`-transformers<5.4`
	`2`	`+transformers>=4.58,<5.4`
Original file line number	Diff line number	Diff line change
`@@ -64,7 +64,7 @@ class DFlashConfig(ModeloptBaseConfig):`
`64`	`64`	`"""DFlash config for block-wise parallel speculative decoding."""`
`65`	`65`
`66`	`66`	`dflash_block_size: int = ModeloptField(`
`67`		`- default=16,`
	`67`	`+ default=8,`
`68`	`68`	`description="Block size for parallel prediction. Draft predicts this many tokens per block.",`
`69`	`69`	`)`
`70`	`70`