graph : add clamping to ffn_moe_weights_sum to avoid div-by-zero#16655
graph : add clamping to ffn_moe_weights_sum to avoid div-by-zero#16655
Conversation
|
CC @am17an, I think we'll need to update the fusion logic to handle this. |
You can perhaps make a PR to this branch, though I don't know if it will automatically rebase on merge then? Perhaps best to make a separate PR afterwards. |
Oh yeah this will work |
Now that you can make a PR branch on another branch in this repo it will rebase automatically. 🎉 |
Unless we have adapted 3rd-party tooling for this, it will likely not be a smooth experience: Stacked PRs are poorly supported in git (and thus in github), especially when combined with squash-merges. |
|
@ORippler looks like they saw your comment https://x.com/jaredpalmer/status/1980619222918262842 |
Interesting. I read this as "we will still restack but will get acceptable run-time complexity by using git reftables", but I must say this is beyond my git knowledge 😄 |
|
@ggerganov gentle ping |
…l-org#16655) * add missing norm topk bias * use clamping instead, update number and add comment
…#16656) * vulkan: Update topk_moe fusion to handle gpt's late softmax Based on ggml-org#16649. * Add ggml_check_edges * Add sync logging to show fusion effects * handle clamp added in ggml-org#16655 * Update ggml/src/ggml-impl.h Co-authored-by: Diego Devesa <slarengh@gmail.com>
I initially added norm topk bias only for BailingMoeV2 in #16063 however it turns out this is present in all models that use
norm_topk_prob, it was probably left out because there was no easy way to add the bias at the time.Edit: Updated to use clamping as the purpose of this bias is to avoid division by zero.
DeepSeekV3
Dots1
Glm4Moe
Edit: Hmmm, ok, Lfm2Moe uses
1e-6, not sure why, I assume1e-20was originally chosen as an insignificantly small number to offset0.0. @tdakhran @paulpak58 @mlabonne?Edit2: It gets more interesting, some models (like Ernie4_5_Moe) uses clamping (to
1e-12) to achieve the same thing.