Skip to content

[Torch 2.10 Megatron GRPO] The AccumulateGrad node's stream does not match the stream of the node that produced the incoming gradient. #8317

@BierOne

Description

@BierOne

Checklist / 检查清单

  • I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。

Bug Description / Bug 描述

Hi, 用最新的镜像训练GRPO (Qwen3.5 35b-a3b)得到了这个warning:

镜像:modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.8.1-py311-torch2.10.0-vllm0.17.0-modelscope1.34.0-swift4.0.1

The AccumulateGrad node's stream does not match the stream of the node that produced the incoming gradient.

请问会不会影响训练精度呢 (目前看好像是torch 2.10.0 + cuda 12.8会有的问题)参考报错

How to Reproduce / 如何复现

training_options=" \
    --model_type qwen3_5_moe \
    --freeze_llm false \
    --freeze_vit true \
    --freeze_aligner true \
    --add_non_thinking_prefix true \
    --loss_scale ignore_empty_think \
    --decoder_first_pipeline_num_layers 24 \
    --steps_per_generation 8 \
    --micro_batch_size 1 \
    --global_batch_size 64 \
    --num_generations 8"
    # --mtp_num_layers ${NUM_MTP_LAYER} \
    # --mtp_loss_scaling_factor 0.1 \
# VLLM options
vllm_options=" \
    --use_vllm true \
    --vllm_mode colocate \
    --vllm_tensor_parallel_size 16 \
    --vllm_gpu_memory_utilization 0.5 \
    --vllm_max_model_len 20480"
# Common training options
common_training_options=" \
    --rlhf_type grpo \
    --loss_type sapo \
    --max_length 16384 \
    --max_completion_length 4086 \
    --lr 1e-5 \
    --lr_warmup_fraction 0.05 \
    --min_lr 1e-6 \
    --train_type full \
    --reward_funcs format \
    --tau_pos 1 \
    --tau_neg 1.05 \
    --epsilon 0.2 \
    --epsilon_high 0.2 \
    --beta 0.001 \
    --finetune true \
    --packing false \
    --padding_free true \
    --dynamic_sample false \
    --num_train_epochs 2 \
    --overlong_filter false \
    --importance_sampling_level token"
    # > 注意:如果开启`overlong_filter`, kl 和 clip_ratio 指标会过滤超长的样本
    # --external_plugins ${CUSTOM_WORK_DIR}/scripts/base/grpo/latest_plugin.py \

# Checkpoint options
checkpoint_options=" \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --attention_backend flash \
    --save_strategy epoch \
    --save_steps 500 \
    --logging_steps 1 \
    --log_completions true \
    --dataloader_num_workers 32 \
    --output_dir ${OUTPUT_DIR}/${OUT_NAME} \
    --no_save_optim \
    --no_save_rng"
# Offload options
offload_options=" \
    --offload_bridge true \
    --sleep_level 2 \
    --offload_model true \
    --offload_optimizer true \
    --optimizer_cpu_offload true"

Additional Information / 补充信息

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions