Skip to content

Bug: Device memory allocation of size 17368219904 failed. #1289

@ai-joe-git

Description

@ai-joe-git

What happened?

ok so i wanted to give a try to ik_llama.cpp i am currently the latest build of vulkan for both and with llama.cpp

my bat thats work perfectly with an even better quants on regular llama.cpp

@echo off

title GLM-4.7-Flash (FULL GPU)

echo ========================================
echo GLM-4.7-Flash - FULL GPU MODE
echo ========================================
echo.
echo Architecture: 30B MoE (~3.6B active)
echo Context: 200K tokens (FULL)
echo Mode: FULL GPU offload
echo Speed: ~15-20 tok/s expected
echo.

cd /d C:\Users\uscha\Desktop\llamaCPP

.\builds\current\bin\llama-server.exe ^
-m "C:\Users\uscha\Desktop\llamaCPP\models\unsloth\GLM-4.7-Flash-UD\GLM-4.7-Flash-UD-Q4_K_XL.gguf" ^
--port 8081 ^
--host 127.0.0.1 ^
--jinja ^
--chat-template-kwargs "{"enable_thinking": false}" ^
-ngl 99 ^
-c 202752 ^
-fa 1 ^
-ot ".ffn_*_exps.=CPU" ^
-t 8 ^
--parallel 1 ^
--repeat-penalty 1.0 ^
--temp 0.7 ^
--top-p 1.0 ^
--min-p 0.01 ^
--no-kv-offload ^
--no-warmup ^
-ctk q8_0 ^
-ctv q8_0

pause

echo.
echo Server: http://127.0.0.1:8081
pause

server logs of regular llama.cpp:

========================================
GLM-4.7-Flash - FULL GPU MODE

Architecture: 30B MoE (~3.6B active)
Context: 200K tokens (FULL)
Mode: FULL GPU offload
Speed: ~15-20 tok/s expected

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) 140V GPU (16GB) (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
build: 8054 (01d8eaa28) with GNU 15.2.0 for Windows AMD64
system info: n_threads = 8, n_threads_batch = 8, total_threads = 8

system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

Running without SSL
init: using 7 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model 'C:\Users\uscha\Desktop\llamaCPP\models\unsloth\GLM-4.7-Flash-UD\GLM-4.7-Flash-UD-Q4_K_XL.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected to use 17422 MiB of device memory vs. 27335 MiB of free device memory
llama_params_fit_impl: will leave 9913 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.31 seconds
llama_model_load_from_file_impl: using device Vulkan0 (Intel(R) Arc(TM) 140V GPU (16GB)) (unknown id) - 27336 MiB free
llama_model_loader: loaded meta data with 59 key-value pairs and 844 tensors from C:\Users\uscha\Desktop\llamaCPP\models\unsloth\GLM-4.7-Flash-UD\GLM-4.7-Flash-UD-Q4_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = deepseek2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.temp f32 = 1.000000
llama_model_loader: - kv 3: general.name str = Glm-4.7-Flash
llama_model_loader: - kv 4: general.basename str = Glm-4.7-Flash
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 64x2.6B
llama_model_loader: - kv 7: general.license str = mit
llama_model_loader: - kv 8: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 9: general.base_model.count u32 = 1
llama_model_loader: - kv 10: general.base_model.0.name str = GLM 4.7 Flash
llama_model_loader: - kv 11: general.base_model.0.organization str = Zai Org
llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/zai-org/GLM-4....
llama_model_loader: - kv 13: general.tags arr[str,2] = ["unsloth", "text-generation"]
llama_model_loader: - kv 14: general.languages arr[str,2] = ["en", "zh"]
llama_model_loader: - kv 15: deepseek2.block_count u32 = 47
llama_model_loader: - kv 16: deepseek2.context_length u32 = 202752
llama_model_loader: - kv 17: deepseek2.embedding_length u32 = 2048
llama_model_loader: - kv 18: deepseek2.feed_forward_length u32 = 10240
llama_model_loader: - kv 19: deepseek2.attention.head_count u32 = 20
llama_model_loader: - kv 20: deepseek2.attention.head_count_kv u32 = 1
llama_model_loader: - kv 21: deepseek2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 22: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 23: deepseek2.expert_used_count u32 = 4
llama_model_loader: - kv 24: deepseek2.expert_group_count u32 = 1
llama_model_loader: - kv 25: deepseek2.expert_group_used_count u32 = 1
llama_model_loader: - kv 26: deepseek2.expert_gating_func u32 = 2
llama_model_loader: - kv 27: deepseek2.leading_dense_block_count u32 = 1
llama_model_loader: - kv 28: deepseek2.vocab_size u32 = 154880
llama_model_loader: - kv 29: deepseek2.attention.q_lora_rank u32 = 768
llama_model_loader: - kv 30: deepseek2.attention.kv_lora_rank u32 = 512
llama_model_loader: - kv 31: deepseek2.attention.key_length u32 = 576
llama_model_loader: - kv 32: deepseek2.attention.value_length u32 = 512
llama_model_loader: - kv 33: deepseek2.attention.key_length_mla u32 = 256
llama_model_loader: - kv 34: deepseek2.attention.value_length_mla u32 = 256
llama_model_loader: - kv 35: deepseek2.expert_feed_forward_length u32 = 1536
llama_model_loader: - kv 36: deepseek2.expert_count u32 = 64
llama_model_loader: - kv 37: deepseek2.expert_shared_count u32 = 1
llama_model_loader: - kv 38: deepseek2.expert_weights_scale f32 = 1.800000
llama_model_loader: - kv 39: deepseek2.expert_weights_norm bool = true
llama_model_loader: - kv 40: deepseek2.rope.dimension_count u32 = 64
llama_model_loader: - kv 41: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 42: tokenizer.ggml.pre str = glm4
llama_model_loader: - kv 43: tokenizer.ggml.tokens arr[str,154880] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 44: tokenizer.ggml.token_type arr[i32,154880] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 45: tokenizer.ggml.merges arr[str,321649] = ["─á ─á", "─á ─á─á─á", "─á─á ─á─á", "...
llama_model_loader: - kv 46: tokenizer.ggml.eos_token_id u32 = 154820
llama_model_loader: - kv 47: tokenizer.ggml.padding_token_id u32 = 154821
llama_model_loader: - kv 48: tokenizer.ggml.bos_token_id u32 = 154822
llama_model_loader: - kv 49: tokenizer.ggml.eot_token_id u32 = 154827
llama_model_loader: - kv 50: tokenizer.ggml.unknown_token_id u32 = 154820
llama_model_loader: - kv 51: tokenizer.ggml.eom_token_id u32 = 154829
llama_model_loader: - kv 52: tokenizer.chat_template str = [gMASK]\n{%- if tools -%}\n<|syste...
llama_model_loader: - kv 53: general.quantization_version u32 = 2
llama_model_loader: - kv 54: general.file_type u32 = 15
llama_model_loader: - kv 55: quantize.imatrix.file str = GLM-4.7-Flash-GGUF/imatrix_unsloth.gguf
llama_model_loader: - kv 56: quantize.imatrix.dataset str = unsloth_calibration_GLM-4.7-Flash.txt
llama_model_loader: - kv 57: quantize.imatrix.entries_count u32 = 607
llama_model_loader: - kv 58: quantize.imatrix.chunks_count u32 = 85
llama_model_loader: - type f32: 281 tensors
llama_model_loader: - type f16: 5 tensors
llama_model_loader: - type q8_0: 180 tensors
llama_model_loader: - type q4_K: 306 tensors
llama_model_loader: - type q5_K: 23 tensors
llama_model_loader: - type q6_K: 49 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 16.31 GiB (4.68 BPW)
load: 0 unused tokens
load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special_eom_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 154820 ('<|endoftext|>')
load: - 154827 ('<|user|>')
load: - 154829 ('<|observation|>')
load: special tokens cache size = 36
load: token to piece cache size = 0.9811 MB
print_info: arch = deepseek2
print_info: vocab_only = 0
print_info: no_alloc = 0
print_info: n_ctx_train = 202752
print_info: n_embd = 2048
print_info: n_embd_inp = 2048
print_info: n_layer = 47
print_info: n_head = 20
print_info: n_head_kv = 1
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 576
print_info: n_embd_head_v = 512
print_info: n_gqa = 20
print_info: n_embd_k_gqa = 576
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 10240
print_info: n_expert = 64
print_info: n_expert_used = 4
print_info: n_expert_groups = 1
print_info: n_group_used = 1
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 202752
print_info: rope_yarn_log_mul = 0.0000
print_info: rope_finetuned = unknown
print_info: model type = 30B.A3B
print_info: model params = 29.94 B
print_info: general.name = Glm-4.7-Flash
print_info: n_layer_dense_lead = 1
print_info: n_lora_q = 768
print_info: n_lora_kv = 512
print_info: n_embd_head_k_mla = 256
print_info: n_embd_head_v_mla = 256
print_info: n_ff_exp = 1536
print_info: n_expert_shared = 1
print_info: expert_weights_scale = 1.8
print_info: expert_weights_norm = 1
print_info: expert_gating_func = sigmoid
print_info: vocab type = BPE
print_info: n_vocab = 154880
print_info: n_merges = 321649
print_info: BOS token = 154822 '[gMASK]'
print_info: EOS token = 154820 '<|endoftext|>'
print_info: EOT token = 154827 '<|user|>'
print_info: EOM token = 154829 '<|observation|>'
print_info: UNK token = 154820 '<|endoftext|>'
print_info: PAD token = 154821 '[MASK]'
print_info: LF token = 198 '─è'
print_info: FIM PRE token = 154838 '<|code_prefix|>'
print_info: FIM SUF token = 154840 '<|code_suffix|>'
print_info: FIM MID token = 154839 '<|code_middle|>'
print_info: EOG token = 154820 '<|endoftext|>'
print_info: EOG token = 154827 '<|user|>'
print_info: EOG token = 154829 '<|observation|>'
print_info: max token length = 1024
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 46 repeating layers to GPU
load_tensors: offloaded 48/48 layers to GPU
load_tensors: CPU_Mapped model buffer size = 170.16 MiB
load_tensors: Vulkan0 model buffer size = 16529.34 MiB
....................................................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|user|> logit bias = -inf
common_init_result: added <|observation|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 202752
llama_context: n_ctx_seq = 202752
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = false
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: Vulkan_Host output buffer size = 0.59 MiB
llama_kv_cache: CPU KV buffer size = 5561.79 MiB
llama_kv_cache: size = 5561.79 MiB (202752 cells, 47 layers, 1/1 seqs), K (q8_0): 5561.79 MiB, V (q8_0): 0.00 MiB
sched_reserve: reserving ...
sched_reserve: Vulkan0 compute buffer size = 892.67 MiB
sched_reserve: Vulkan_Host compute buffer size = 404.01 MiB
sched_reserve: graph nodes = 3317
sched_reserve: graph splits = 96
sched_reserve: reserve took 860.70 ms, sched copies = 1
srv load_model: initializing slots, n_slots = 1
no implementations specified for speculative decoding
slot load_model: id 0 | task -1 | speculative decoding context not initialized
slot load_model: id 0 | task -1 | new slot, n_ctx = 202752
srv load_model: prompt cache is enabled, size limit: 8192 MiB
srv load_model: use --cache-ram 0 to disable the prompt cache
srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
init: chat template, example_format: '[gMASK]<|system|>You are a helpful assistant<|user|>Hello<|assistant|>Hi there<|user|>How are you?<|assistant|>'
srv init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://127.0.0.1:8081
main: starting the main loop...
srv update_slots: all slots are idle

and with the exact same settings with ik_llama.cpp and an ik quant see the bat below it crash

@echo off

title GLM-4.7-Flash (FULL GPU)

echo ========================================
echo GLM-4.7-Flash - FULL GPU MODE
echo ========================================
echo.
echo Architecture: 30B MoE (~3.6B active)
echo Context: 200K tokens (FULL)
echo Mode: FULL GPU offload
echo Speed: ~15-20 tok/s expected
echo.

cd /d C:\Users\uscha\Desktop\ik_llamaCPP

.\builds\current\bin\llama-server.exe ^
-m "C:\Users\uscha\Desktop\ik_llamaCPP\models\GLM-4.7-Flash-smol-IQ4_KSS.gguf" ^
--port 8081 ^
--host 127.0.0.1 ^
--jinja ^
--chat-template-kwargs "{"enable_thinking": false}" ^
-ngl 99 ^
-c 202752 ^
-fa 1 ^
-ot ".ffn_*_exps.=CPU" ^
-t 8 ^
--parallel 1 ^
--repeat-penalty 1.0 ^
--temp 0.7 ^
--top-p 1.0 ^
--min-p 0.01 ^
--no-kv-offload ^
--no-warmup ^
-ctk q8_0 ^
-ctv q8_0

pause

echo.
echo Server: http://127.0.0.1:8081
pause

ik_llama.cpp server log

========================================
GLM-4.7-Flash - FULL GPU MODE

Architecture: 30B MoE (~3.6B active)
Context: 200K tokens (FULL)
Mode: FULL GPU offload
Speed: ~15-20 tok/s expected

ggml_vulkan: 0 = Intel(R) Arc(TM) 140V GPU (16GB) (Intel Corporation) | uma: 1 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
INFO [ main] build info | tid="1" timestamp=1771526337 build=4209 commit="b855bf92"
INFO [ main] system info | tid="1" timestamp=1771526337 n_threads=8 n_threads_batch=-1 total_threads=8 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "
Vulkan0: using device Vulkan0 - 28104 MiB free
llama_model_loader: loaded meta data with 51 key-value pairs and 844 tensors from C:\Users\uscha\Desktop\ik_llamaCPP\models\GLM-4.7-Flash-smol-IQ4_KSS.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = deepseek2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.temp f32 = 1.000000
llama_model_loader: - kv 3: general.name str = GLM 4.7 Flash
llama_model_loader: - kv 4: general.size_label str = 64x2.6B
llama_model_loader: - kv 5: general.license str = mit
llama_model_loader: - kv 6: general.tags arr[str,1] = ["text-generation"]
llama_model_loader: - kv 7: general.languages arr[str,2] = ["en", "zh"]
llama_model_loader: - kv 8: deepseek2.block_count u32 = 47
llama_model_loader: - kv 9: deepseek2.context_length u32 = 202752
llama_model_loader: - kv 10: deepseek2.embedding_length u32 = 2048
llama_model_loader: - kv 11: deepseek2.feed_forward_length u32 = 10240
llama_model_loader: - kv 12: deepseek2.attention.head_count u32 = 20
llama_model_loader: - kv 13: deepseek2.attention.head_count_kv u32 = 1
llama_model_loader: - kv 14: deepseek2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 15: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: deepseek2.expert_used_count u32 = 4
llama_model_loader: - kv 17: deepseek2.expert_group_count u32 = 1
llama_model_loader: - kv 18: deepseek2.expert_group_used_count u32 = 1
llama_model_loader: - kv 19: general.file_type u32 = 148
llama_model_loader: - kv 20: deepseek2.leading_dense_block_count u32 = 1
llama_model_loader: - kv 21: deepseek2.vocab_size u32 = 154880
llama_model_loader: - kv 22: deepseek2.attention.q_lora_rank u32 = 768
llama_model_loader: - kv 23: deepseek2.attention.kv_lora_rank u32 = 512
llama_model_loader: - kv 24: deepseek2.attention.key_length u32 = 576
llama_model_loader: - kv 25: deepseek2.attention.value_length u32 = 512
llama_model_loader: - kv 26: deepseek2.attention.key_length_mla u32 = 256
llama_model_loader: - kv 27: deepseek2.attention.value_length_mla u32 = 256
llama_model_loader: - kv 28: deepseek2.expert_feed_forward_length u32 = 1536
llama_model_loader: - kv 29: deepseek2.expert_count u32 = 64
llama_model_loader: - kv 30: deepseek2.expert_shared_count u32 = 1
llama_model_loader: - kv 31: deepseek2.expert_weights_scale f32 = 1.800000
llama_model_loader: - kv 32: deepseek2.expert_weights_norm bool = true
llama_model_loader: - kv 33: deepseek2.rope.dimension_count u32 = 64
llama_model_loader: - kv 34: general.quantization_version u32 = 2
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 36: tokenizer.ggml.pre str = glm4
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,154880] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,154880] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,321649] = ["─á ─á", "─á ─á─á─á", "─á─á ─á─á", "...
llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 154820
llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 154820
llama_model_loader: - kv 42: tokenizer.ggml.bos_token_id u32 = 154822
llama_model_loader: - kv 43: tokenizer.ggml.eot_token_id u32 = 154827
llama_model_loader: - kv 44: tokenizer.ggml.unknown_token_id u32 = 154820
llama_model_loader: - kv 45: tokenizer.ggml.eom_token_id u32 = 154829
llama_model_loader: - kv 46: tokenizer.chat_template str = [gMASK]\n{%- if tools -%}\n<|syste...
llama_model_loader: - kv 47: quantize.imatrix.file str = /mnt/raid/models/ubergarm/GLM-4.7-Fla...
llama_model_loader: - kv 48: quantize.imatrix.dataset str = ubergarm-imatrix-calibration-corpus-v...
llama_model_loader: - kv 49: quantize.imatrix.entries_count i32 = 608
llama_model_loader: - kv 50: quantize.imatrix.chunks_count i32 = 813
llama_model_loader: - type f32: 281 tensors
llama_model_loader: - type q8_0: 420 tensors
llama_model_loader: - type iq4_k: 1 tensors
llama_model_loader: - type iq6_k: 4 tensors
llama_model_loader: - type iq4_kss: 138 tensors
================= Adjusted mainline llama.cpp MLA tensors to ik_llama.cpp
================= Missing experts gating function -> set to sigmoid
load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special_eom_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 154820 ('<|endoftext|>')
load: - 154827 ('<|user|>')
load: - 154829 ('<|observation|>')
load: special tokens cache size = 36
load: token to piece cache size = 0.9811 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = deepseek2
llm_load_print_meta: n_ctx_train = 202752
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_layer = 47
llm_load_print_meta: n_head = 20
llm_load_print_meta: n_head_kv = 20
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_swa_pattern = 1
llm_load_print_meta: n_embd_head_k = 256
llm_load_print_meta: n_embd_head_v = 256
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 5120
llm_load_print_meta: n_embd_v_gqa = 5120
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 10240
llm_load_print_meta: n_expert = 64
llm_load_print_meta: n_expert_used = 4
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 202752
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 30B.A3B
llm_load_print_meta: model ftype = IQ4_KSS - 4.0 bpw
llm_load_print_meta: model params = 29.943 B
llm_load_print_meta: model size = 14.918 GiB (4.280 BPW)
llm_load_print_meta: repeating layers = 14.507 GiB (4.252 BPW, 29.309 B parameters)
llm_load_print_meta: general.name = GLM 4.7 Flash
llm_load_print_meta: n_layer_dense_lead = 1
llm_load_print_meta: n_lora_q = 768
llm_load_print_meta: n_lora_kv = 512
llm_load_print_meta: n_ff_exp = 1536
llm_load_print_meta: n_expert_shared = 1
llm_load_print_meta: expert_weights_scale = 1.8
llm_load_print_meta: expert_weights_norm = 1
llm_load_print_meta: expert_gating_func = sigmoid
llm_load_print_meta: rope_yarn_log_mul = 0.0000
print_info: vocab type = BPE
print_info: n_vocab = 154880
print_info: n_merges = 321649
print_info: BOS token = 154822 '[gMASK]'
print_info: EOS token = 154820 '<|endoftext|>'
print_info: EOT token = 154827 '<|user|>'
print_info: EOM token = 154829 '<|observation|>'
print_info: UNK token = 154820 '<|endoftext|>'
print_info: PAD token = 154820 '<|endoftext|>'
print_info: LF token = 198 '─è'
print_info: FIM PRE token = 154838 '<|code_prefix|>'
print_info: FIM SUF token = 154840 '<|code_suffix|>'
print_info: FIM MID token = 154839 '<|code_middle|>'
print_info: EOG token = 154820 '<|endoftext|>'
print_info: EOG token = 154827 '<|user|>'
print_info: EOG token = 154829 '<|observation|>'
print_info: max token length = 1024
llm_load_tensors: ggml ctx size = 0.69 MiB
llm_load_tensors: offloading 47 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 48/48 layers to GPU
llm_load_tensors: Vulkan0 buffer size = 15105.76 MiB
llm_load_tensors: CPU buffer size = 170.16 MiB
...................................................................................................
============ llm_prepare_mla: need to compute 47 wkv_b tensors
Computed blk.0.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.1.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.2.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.3.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.4.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.5.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.6.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.7.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.8.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.9.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.10.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.11.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.12.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.13.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.14.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.15.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.16.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.17.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.18.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.19.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.20.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.21.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.22.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.23.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.24.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.25.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.26.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.27.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.28.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.29.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.30.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.31.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.32.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.33.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.34.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.35.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.36.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.37.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.38.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.39.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.40.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.41.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.42.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.43.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.44.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.45.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
Computed blk.46.attn_kv_b.weight as 512 x 8960 and stored in buffer Vulkan0
===================================== llama_init_from_model: f16
llama_init_from_model: n_ctx = 202752
llama_init_from_model: n_batch = 2048
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 1
llama_init_from_model: mla_attn = 3
llama_init_from_model: attn_max_b = 0
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 1000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 5561.79 MiB
llama_init_from_model: KV self size = 5561.79 MiB, c^KV (q8_0): 5561.79 MiB, kv^T: not used
llama_init_from_model: CPU output buffer size = 0.59 MiB
ggml_vulkan: Device memory allocation of size 17368219904 failed.
ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory
ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 17368219904
llama_init_from_model: failed to allocate compute buffers
llama_init_from_gpt_params: error: failed to create context with model 'C:\Users\uscha\Desktop\ik_llamaCPP\models\GLM-4.7-Flash-smol-IQ4_KSS.gguf'
ERR [ load_model] unable to load model | tid="1" timestamp=1771526370 model="C:\Users\uscha\Desktop\ik_llamaCPP\models\GLM-4.7-Flash-smol-IQ4_KSS.gguf"
Appuyez sur une touche pour continuer...

Name and Version

build-b855bf92

What operating system are you seeing the problem on?

Windows

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    wontfixThis will not be worked on

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions