Checklist / 检查清单
Bug Description / Bug 描述
Hi, 用最新的镜像训练GRPO (Qwen3.5 35b-a3b)得到了这个warning:
镜像:modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.8.1-py311-torch2.10.0-vllm0.17.0-modelscope1.34.0-swift4.0.1
The AccumulateGrad node's stream does not match the stream of the node that produced the incoming gradient.
请问会不会影响训练精度呢 (目前看好像是torch 2.10.0 + cuda 12.8会有的问题)参考报错
How to Reproduce / 如何复现
training_options=" \
--model_type qwen3_5_moe \
--freeze_llm false \
--freeze_vit true \
--freeze_aligner true \
--add_non_thinking_prefix true \
--loss_scale ignore_empty_think \
--decoder_first_pipeline_num_layers 24 \
--steps_per_generation 8 \
--micro_batch_size 1 \
--global_batch_size 64 \
--num_generations 8"
# --mtp_num_layers ${NUM_MTP_LAYER} \
# --mtp_loss_scaling_factor 0.1 \
# VLLM options
vllm_options=" \
--use_vllm true \
--vllm_mode colocate \
--vllm_tensor_parallel_size 16 \
--vllm_gpu_memory_utilization 0.5 \
--vllm_max_model_len 20480"
# Common training options
common_training_options=" \
--rlhf_type grpo \
--loss_type sapo \
--max_length 16384 \
--max_completion_length 4086 \
--lr 1e-5 \
--lr_warmup_fraction 0.05 \
--min_lr 1e-6 \
--train_type full \
--reward_funcs format \
--tau_pos 1 \
--tau_neg 1.05 \
--epsilon 0.2 \
--epsilon_high 0.2 \
--beta 0.001 \
--finetune true \
--packing false \
--padding_free true \
--dynamic_sample false \
--num_train_epochs 2 \
--overlong_filter false \
--importance_sampling_level token"
# > 注意:如果开启`overlong_filter`, kl 和 clip_ratio 指标会过滤超长的样本
# --external_plugins ${CUSTOM_WORK_DIR}/scripts/base/grpo/latest_plugin.py \
# Checkpoint options
checkpoint_options=" \
--recompute_granularity full \
--recompute_method uniform \
--recompute_num_layers 1 \
--attention_backend flash \
--save_strategy epoch \
--save_steps 500 \
--logging_steps 1 \
--log_completions true \
--dataloader_num_workers 32 \
--output_dir ${OUTPUT_DIR}/${OUT_NAME} \
--no_save_optim \
--no_save_rng"
# Offload options
offload_options=" \
--offload_bridge true \
--sleep_level 2 \
--offload_model true \
--offload_optimizer true \
--optimizer_cpu_offload true"
Additional Information / 补充信息
No response
Checklist / 检查清单
Bug Description / Bug 描述
Hi, 用最新的镜像训练GRPO (Qwen3.5 35b-a3b)得到了这个warning:
镜像:modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.8.1-py311-torch2.10.0-vllm0.17.0-modelscope1.34.0-swift4.0.1
The AccumulateGrad node's stream does not match the stream of the node that produced the incoming gradient.
请问会不会影响训练精度呢 (目前看好像是torch 2.10.0 + cuda 12.8会有的问题)参考报错
How to Reproduce / 如何复现
Additional Information / 补充信息
No response