Force FP32 compute in GLM4 FFN Down#13101
Conversation
|
RDNA3/GFX1100 and this is the first time I've ever seen actual sentences come out of a GLM 4 model. While the text gen delta is also non-existent, for me the prompt processing delta is absolutely horrendous. 230 t/s on this pr vs 1016 t/s on master for gemma qat I think a better solution might just be to force-enable mmq, which produces good outputs on GLM 4 and still has usable prompt processing @ 636 t/s. I wonder if part of the terrible RDNA3 perf is part of the poor f32 library support described #12372 (comment)? |
JohannesGaessler
left a comment
There was a problem hiding this comment.
The decision of whether to use 16F or 32F compute should be done based on what is set using ggml_mul_mat_set_prec.
|
Oh, interesting. I assume that's what the So theoretically if I find which tensor/op is causing issues I can add a case for GLM4 to force FP32 for that specific tensor with |
|
Not quite, the current logic around which precision to use is not very good. As it is the code will convert |
This reverts commit 6efd872.
|
Okay, I'm a bit sleep deprived, but the overflow happens inside the FFN. Should I:
|
JohannesGaessler
left a comment
There was a problem hiding this comment.
I think this is the correct solution for a hotfix, the top priority should be to get correct results even if the performance is suboptimal. The performance can then be improved in subsequent PRs.
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
|
@Beinsezii Could you re-test on AMD for correctness / performance impact if you have the time? I've done a few tests and it seems to be correct on my end with this even with long-ish context input / multi-turn. |
|
I'm merging this for now, if more fixes are needed those can be done in separate PRs. |
|
While this PR fixes the issue with my AMD 7900xtx under ROCm, I am still getting the repeating characters with the same 7900xtx under the Vulkan backend. The issue does differ slightly from the one that I previously experienced under ROCm:
They do share similarities:
|
|
Hmm, I don't have a working vulkan build setup so I'm not sure I can help throubleshoot that part. The < 64 tokens still working & the breakage only happening on the second turn/with longer prompts on Volta/ROCm was because it was overflowing during prompt processing I think. The vulkan issue almost sounds like it happens during generation instead but I'm mostly just guessing. |
* Force FP32 compute in cuBLAS GEMM * Revert "Force FP32 compute in cuBLAS GEMM" This reverts commit 6efd872. * Force F32 compute in GLM4 ffn down * Edit comment to clarify issue Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Force FP32 compute in cuBLAS GEMM * Revert "Force FP32 compute in cuBLAS GEMM" This reverts commit 6efd872. * Force F32 compute in GLM4 ffn down * Edit comment to clarify issue Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
See discussion in #12946 - the new GLM4 models seems to have issues with FP16 matmul. Should be reproducible with
GGML_CUDA_FORCE_CUBLAS=1on any card that supports FP16.Based on the comment for
GGML_CUDA_FORCE_MMQin docs/build.md, this would most likely affect V100, CDNA and RDNA3+, though I only have a Volta card to test on. Unsure if anything else ever hits this specific branch of the code.Seems like it largely affects prompt processing speed (compute bound), but unsure if there's a better way to fix this specific code path other than adding more checks.
Master:
PR:
(Also ran
test-backend-ops, seems ok)