Skip to content

Eval bug: Multimodality broken in b8091 onwards #19735

@markqvist

Description

@markqvist

Name and Version

version: 8091 (238856e)
built with GNU 11.4.0 for Linux x86_64

Operating systems

Linux

GGML backends

Vulkan

Hardware

Radeon 8060S

Models

Qwen3 VL
GLM 4.6v

Problem description & steps to reproduce

On at least Vulkan + AMD GFX1151, image encoding was broken in release b8091 onwards, for all multimodal models I've been able to test with:

  • All variants of Qwen3 VL
  • GLM 4.6v and GLM 4.6v Flash

Rolling back to b8089 solves the problem.

Expected Behaviour: Correct image slicing and encoding (on b8089)

llama-server log
load_backend: loaded RPC backend from bin.8089/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from bin.8089/libggml-vulkan.so
load_backend: loaded CPU backend from bin.8089/libggml-cpu-zen4.so
build: 8089 (d0061be83) with GNU 11.4.0 for Linux x86_64
system info: n_threads = 36, n_threads_batch = 36, total_threads = 32

system_info: n_threads = 36 (n_threads_batch = 36) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

Running without SSL
init: using 31 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model '/storage/1/models/gguf/qwen3vl/instruct/Qwen3-VL-30B-A3B-Instruct-UD-Q8_K_XL.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected to use 52769 MiB of device memory vs. 97567 MiB of free device memory
llama_params_fit_impl: will leave 44797 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.15 seconds
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) (0000:c6:00.0) - 97568 MiB free
llama_model_loader: loaded meta data with 45 key-value pairs and 579 tensors from /storage/1/models/gguf/qwen3vl/instruct/Qwen3-VL-30B-A3B-Instruct-UD-Q8_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3vlmoe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3-Vl-30B-A3B-Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen3-Vl-30B-A3B-Instruct
llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
llama_model_loader: - kv   7:                            general.license str              = apache-2.0
llama_model_loader: - kv   8:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 VL 30B A3B Instruct
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
llama_model_loader: - kv  13:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
llama_model_loader: - kv  14:                     qwen3vlmoe.block_count u32              = 48
llama_model_loader: - kv  15:                  qwen3vlmoe.context_length u32              = 262144
llama_model_loader: - kv  16:                qwen3vlmoe.embedding_length u32              = 2048
llama_model_loader: - kv  17:             qwen3vlmoe.feed_forward_length u32              = 6144
llama_model_loader: - kv  18:            qwen3vlmoe.attention.head_count u32              = 32
llama_model_loader: - kv  19:         qwen3vlmoe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  20:                  qwen3vlmoe.rope.freq_base f32              = 5000000.000000
llama_model_loader: - kv  21: qwen3vlmoe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  22:               qwen3vlmoe.expert_used_count u32              = 8
llama_model_loader: - kv  23:            qwen3vlmoe.attention.key_length u32              = 128
llama_model_loader: - kv  24:          qwen3vlmoe.attention.value_length u32              = 128
llama_model_loader: - kv  25:                    qwen3vlmoe.expert_count u32              = 128
llama_model_loader: - kv  26:      qwen3vlmoe.expert_feed_forward_length u32              = 768
llama_model_loader: - kv  27:         qwen3vlmoe.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
llama_model_loader: - kv  28:              qwen3vlmoe.n_deepstack_layers u32              = 3
llama_model_loader: - kv  29:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  30:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  31:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  32:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  33:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  34:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  35:            tokenizer.ggml.padding_token_id u32              = 151654
llama_model_loader: - kv  36:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  37:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  38:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  39:               general.quantization_version u32              = 2
llama_model_loader: - kv  40:                          general.file_type u32              = 7
llama_model_loader: - kv  41:                      quantize.imatrix.file str              = Qwen3-VL-30B-A3B-Instruct-GGUF/imatri...
llama_model_loader: - kv  42:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3-VL-30B-A3B-...
llama_model_loader: - kv  43:             quantize.imatrix.entries_count u32              = 384
llama_model_loader: - kv  44:              quantize.imatrix.chunks_count u32              = 154
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type  f16:   75 tensors
llama_model_loader: - type q8_0:  263 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 33.51 GiB (9.43 BPW) 
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch                  = qwen3vlmoe
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 2048
print_info: n_embd_inp            = 8192
print_info: n_layer               = 48
print_info: n_head                = 32
print_info: n_head_kv             = 4
print_info: n_rot                 = 128
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 128
print_info: n_embd_head_v         = 128
print_info: n_gqa                 = 8
print_info: n_embd_k_gqa          = 512
print_info: n_embd_v_gqa          = 512
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 6144
print_info: n_expert              = 128
print_info: n_expert_used         = 8
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = 0
print_info: rope type             = 40
print_info: rope scaling          = linear
print_info: freq_base_train       = 5000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [24, 20, 20, 0]
print_info: model type            = 30B.A3B
print_info: model params          = 30.53 B
print_info: general.name          = Qwen3-Vl-30B-A3B-Instruct
print_info: n_ff_exp              = 768
print_info: vocab type            = BPE
print_info: n_vocab               = 151936
print_info: n_merges              = 151387
print_info: BOS token             = 151643 '<|endoftext|>'
print_info: EOS token             = 151645 '<|im_end|>'
print_info: EOT token             = 151645 '<|im_end|>'
print_info: PAD token             = 151654 '<|vision_pad|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 151659 '<|fim_prefix|>'
print_info: FIM SUF token         = 151661 '<|fim_suffix|>'
print_info: FIM MID token         = 151660 '<|fim_middle|>'
print_info: FIM PAD token         = 151662 '<|fim_pad|>'
print_info: FIM REP token         = 151663 '<|repo_name|>'
print_info: FIM SEP token         = 151664 '<|file_sep|>'
print_info: EOG token             = 151643 '<|endoftext|>'
print_info: EOG token             = 151645 '<|im_end|>'
print_info: EOG token             = 151662 '<|fim_pad|>'
print_info: EOG token             = 151663 '<|repo_name|>'
print_info: EOG token             = 151664 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 47 repeating layers to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   593.50 MiB
load_tensors:      Vulkan0 model buffer size = 33723.49 MiB
.................................................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 196608
llama_context: n_ctx_seq     = 196608
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 5000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (196608) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host  output buffer size =     0.58 MiB
llama_kv_cache:    Vulkan0 KV buffer size = 18432.00 MiB
llama_kv_cache: size = 18432.00 MiB (196608 cells,  48 layers,  1/1 seqs), K (f16): 9216.00 MiB, V (f16): 9216.00 MiB
sched_reserve: reserving ...
sched_reserve:    Vulkan0 compute buffer size =   614.02 MiB
sched_reserve: Vulkan_Host compute buffer size =   404.02 MiB
sched_reserve: graph nodes  = 3039
sched_reserve: graph splits = 2
sched_reserve: reserve took 24.07 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
clip_model_loader: model name:   Qwen3-Vl-30B-A3B-Instruct
clip_model_loader: description:  
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    352
clip_model_loader: n_kv:         31

clip_model_loader: has vision encoder
clip_ctx: CLIP using Vulkan0 backend
load_hparams: Qwen-VL models require at minimum 1024 image tokens to function correctly on grounding tasks
load_hparams: if you encounter problems with accuracy, try adding --image-min-tokens 1024
load_hparams: more info: https://github.com/ggml-org/llama.cpp/issues/16842

load_hparams: projector:          qwen3vl_merger
load_hparams: n_embd:             1152
load_hparams: n_head:             16
load_hparams: n_ff:               4304
load_hparams: n_layer:            27
load_hparams: ffn_op:             gelu
load_hparams: projection_dim:     2048

--- vision hparams ---
load_hparams: image_size:         768
load_hparams: patch_size:         16
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: n_merge:            2
load_hparams: n_wa_pattern: 0
load_hparams: image_min_pixels:   8192
load_hparams: image_max_pixels:   4194304

load_hparams: model size:         1033.29 MiB
load_hparams: metadata size:      0.12 MiB
warmup: warmup with image size = 1472 x 1472
alloc_compute_meta:    Vulkan0 compute buffer size =   387.06 MiB
alloc_compute_meta:        CPU compute buffer size =    24.93 MiB
alloc_compute_meta: graph splits = 1, nodes = 853
warmup: flash attention is enabled
srv    load_model: loaded multimodal model, '/storage/1/models/gguf/qwen3vl/instruct/mmproj_30b-F16.gguf'
srv    load_model: initializing slots, n_slots = 1
no implementations specified for speculative decoding
slot   load_model: id  0 | task -1 | speculative decoding context not initialized
slot   load_model: id  0 | task -1 | new slot, n_ctx = 196608
srv    load_model: prompt cache is enabled, size limit: 8192 MiB
srv    load_model: use `--cache-ram 0` to disable the prompt cache
srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
srv          init: init: chat template, thinking = 0
main: model loaded
main: server is listening on http://127.0.0.1:5815
main: starting the main loop...
srv  update_slots: all slots are idle
srv  log_server_r: done request: GET / 127.0.0.1 200
srv  log_server_r: done request: GET / 127.0.0.1 200
srv  params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist 
slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 2708
slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 4, batch.n_tokens = 4, progress = 0.001477
slot update_slots: id  0 | task 0 | n_tokens = 4, memory_seq_rm [4, end)
srv  process_chun: processing image...
encoding image slice...
image slice encoded in 3959 ms
decoding image batch 1/2, n_tokens_batch = 2048
image decoded (batch 1/2) in 1631 ms
decoding image batch 2/2, n_tokens_batch = 650
image decoded (batch 2/2) in 1090 ms
srv  process_chun: image processed in 6680 ms
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 2708, batch.n_tokens = 6, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_tokens = 2708, batch.n_tokens = 6
slot init_sampler: id  0 | task 0 | init sampler, took 0.02 ms, tokens: text = 10, total = 2708
slot print_timing: id  0 | task 0 | 
prompt eval time =    7150.25 ms /  2708 tokens (    2.64 ms per token,   378.73 tokens per second)
       eval time =    8772.44 ms /   367 tokens (   23.90 ms per token,    41.84 tokens per second)
      total time =   15922.69 ms /  3075 tokens
slot      release: id  0 | task 0 | stop processing: n_tokens = 3074, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200

Result

A correct description of the image.

Based on the image provided, here is a detailed description of the scene:

[ ... extensive description of the scene in the image ... ]

Actual Behaviour: No image slicing and encoding (on b8091 onwards)

failing llama-server log
load_backend: loaded RPC backend from bin.8091/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from bin.8091/libggml-vulkan.so
load_backend: loaded CPU backend from bin.8091/libggml-cpu-zen4.so
build: 8091 (238856ec8) with GNU 11.4.0 for Linux x86_64
system info: n_threads = 36, n_threads_batch = 36, total_threads = 32

system_info: n_threads = 36 (n_threads_batch = 36) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

Running without SSL
init: using 31 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model '/storage/1/models/gguf/qwen3vl/instruct/Qwen3-VL-30B-A3B-Instruct-UD-Q8_K_XL.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected to use 52769 MiB of device memory vs. 97568 MiB of free device memory
llama_params_fit_impl: will leave 44798 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.14 seconds
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) (0000:c6:00.0) - 97568 MiB free
llama_model_loader: loaded meta data with 45 key-value pairs and 579 tensors from /storage/1/models/gguf/qwen3vl/instruct/Qwen3-VL-30B-A3B-Instruct-UD-Q8_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3vlmoe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3-Vl-30B-A3B-Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen3-Vl-30B-A3B-Instruct
llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
llama_model_loader: - kv   7:                            general.license str              = apache-2.0
llama_model_loader: - kv   8:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 VL 30B A3B Instruct
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
llama_model_loader: - kv  13:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
llama_model_loader: - kv  14:                     qwen3vlmoe.block_count u32              = 48
llama_model_loader: - kv  15:                  qwen3vlmoe.context_length u32              = 262144
llama_model_loader: - kv  16:                qwen3vlmoe.embedding_length u32              = 2048
llama_model_loader: - kv  17:             qwen3vlmoe.feed_forward_length u32              = 6144
llama_model_loader: - kv  18:            qwen3vlmoe.attention.head_count u32              = 32
llama_model_loader: - kv  19:         qwen3vlmoe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  20:                  qwen3vlmoe.rope.freq_base f32              = 5000000.000000
llama_model_loader: - kv  21: qwen3vlmoe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  22:               qwen3vlmoe.expert_used_count u32              = 8
llama_model_loader: - kv  23:            qwen3vlmoe.attention.key_length u32              = 128
llama_model_loader: - kv  24:          qwen3vlmoe.attention.value_length u32              = 128
llama_model_loader: - kv  25:                    qwen3vlmoe.expert_count u32              = 128
llama_model_loader: - kv  26:      qwen3vlmoe.expert_feed_forward_length u32              = 768
llama_model_loader: - kv  27:         qwen3vlmoe.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
llama_model_loader: - kv  28:              qwen3vlmoe.n_deepstack_layers u32              = 3
llama_model_loader: - kv  29:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  30:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  31:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  32:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  33:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  34:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  35:            tokenizer.ggml.padding_token_id u32              = 151654
llama_model_loader: - kv  36:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  37:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  38:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  39:               general.quantization_version u32              = 2
llama_model_loader: - kv  40:                          general.file_type u32              = 7
llama_model_loader: - kv  41:                      quantize.imatrix.file str              = Qwen3-VL-30B-A3B-Instruct-GGUF/imatri...
llama_model_loader: - kv  42:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3-VL-30B-A3B-...
llama_model_loader: - kv  43:             quantize.imatrix.entries_count u32              = 384
llama_model_loader: - kv  44:              quantize.imatrix.chunks_count u32              = 154
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type  f16:   75 tensors
llama_model_loader: - type q8_0:  263 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 33.51 GiB (9.43 BPW) 
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch                  = qwen3vlmoe
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 2048
print_info: n_embd_inp            = 8192
print_info: n_layer               = 48
print_info: n_head                = 32
print_info: n_head_kv             = 4
print_info: n_rot                 = 128
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 128
print_info: n_embd_head_v         = 128
print_info: n_gqa                 = 8
print_info: n_embd_k_gqa          = 512
print_info: n_embd_v_gqa          = 512
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 6144
print_info: n_expert              = 128
print_info: n_expert_used         = 8
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = 0
print_info: rope type             = 40
print_info: rope scaling          = linear
print_info: freq_base_train       = 5000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [24, 20, 20, 0]
print_info: model type            = 30B.A3B
print_info: model params          = 30.53 B
print_info: general.name          = Qwen3-Vl-30B-A3B-Instruct
print_info: n_ff_exp              = 768
print_info: vocab type            = BPE
print_info: n_vocab               = 151936
print_info: n_merges              = 151387
print_info: BOS token             = 151643 '<|endoftext|>'
print_info: EOS token             = 151645 '<|im_end|>'
print_info: EOT token             = 151645 '<|im_end|>'
print_info: PAD token             = 151654 '<|vision_pad|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 151659 '<|fim_prefix|>'
print_info: FIM SUF token         = 151661 '<|fim_suffix|>'
print_info: FIM MID token         = 151660 '<|fim_middle|>'
print_info: FIM PAD token         = 151662 '<|fim_pad|>'
print_info: FIM REP token         = 151663 '<|repo_name|>'
print_info: FIM SEP token         = 151664 '<|file_sep|>'
print_info: EOG token             = 151643 '<|endoftext|>'
print_info: EOG token             = 151645 '<|im_end|>'
print_info: EOG token             = 151662 '<|fim_pad|>'
print_info: EOG token             = 151663 '<|repo_name|>'
print_info: EOG token             = 151664 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 47 repeating layers to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   593.50 MiB
load_tensors:      Vulkan0 model buffer size = 33723.49 MiB
.................................................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 196608
llama_context: n_ctx_seq     = 196608
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 5000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (196608) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host  output buffer size =     0.58 MiB
llama_kv_cache:    Vulkan0 KV buffer size = 18432.00 MiB
llama_kv_cache: size = 18432.00 MiB (196608 cells,  48 layers,  1/1 seqs), K (f16): 9216.00 MiB, V (f16): 9216.00 MiB
sched_reserve: reserving ...
sched_reserve:    Vulkan0 compute buffer size =   614.02 MiB
sched_reserve: Vulkan_Host compute buffer size =   404.02 MiB
sched_reserve: graph nodes  = 3039
sched_reserve: graph splits = 2
sched_reserve: reserve took 21.11 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
clip_model_loader: model name:   Qwen3-Vl-30B-A3B-Instruct
clip_model_loader: description:  
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    352
clip_model_loader: n_kv:         31

clip_model_loader: has vision encoder
clip_ctx: CLIP using Vulkan0 backend
load_hparams: Qwen-VL models require at minimum 1024 image tokens to function correctly on grounding tasks
load_hparams: if you encounter problems with accuracy, try adding --image-min-tokens 1024
load_hparams: more info: https://github.com/ggml-org/llama.cpp/issues/16842

load_hparams: projector:          qwen3vl_merger
load_hparams: n_embd:             1152
load_hparams: n_head:             16
load_hparams: n_ff:               4304
load_hparams: n_layer:            27
load_hparams: ffn_op:             gelu
load_hparams: projection_dim:     2048

--- vision hparams ---
load_hparams: image_size:         768
load_hparams: patch_size:         16
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: n_merge:            2
load_hparams: n_wa_pattern: 0
load_hparams: image_min_pixels:   8192
load_hparams: image_max_pixels:   4194304

load_hparams: model size:         1033.29 MiB
load_hparams: metadata size:      0.12 MiB
warmup: warmup with image size = 1472 x 1472
alloc_compute_meta:    Vulkan0 compute buffer size =   387.06 MiB
alloc_compute_meta:        CPU compute buffer size =    24.93 MiB
alloc_compute_meta: graph splits = 1, nodes = 853
warmup: flash attention is enabled
srv    load_model: loaded multimodal model, '/storage/1/models/gguf/qwen3vl/instruct/mmproj_30b-F16.gguf'
srv    load_model: initializing slots, n_slots = 1
no implementations specified for speculative decoding
slot   load_model: id  0 | task -1 | speculative decoding context not initialized
slot   load_model: id  0 | task -1 | new slot, n_ctx = 196608
srv    load_model: prompt cache is enabled, size limit: 8192 MiB
srv    load_model: use `--cache-ram 0` to disable the prompt cache
srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
srv          init: init: chat template, thinking = 0
main: model loaded
main: server is listening on http://127.0.0.1:5815
main: starting the main loop...
srv  update_slots: all slots are idle
srv  log_server_r: done request: GET / 127.0.0.1 200
srv  log_server_r: done request: GET / 127.0.0.1 200
srv  params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist 
slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 8
slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 8, batch.n_tokens = 8, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_tokens = 8, batch.n_tokens = 8
slot init_sampler: id  0 | task 0 | init sampler, took 0.01 ms, tokens: text = 8, total = 8
slot print_timing: id  0 | task 0 | 
prompt eval time =      69.02 ms /     8 tokens (    8.63 ms per token,   115.91 tokens per second)
       eval time =    2696.43 ms /   121 tokens (   22.28 ms per token,    44.87 tokens per second)
      total time =    2765.44 ms /   129 tokens
slot      release: id  0 | task 0 | stop processing: n_tokens = 128, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200

Result

Completely hallucinated output about the image:

The image you've shared appears to be a screenshot of a social media post, likely from TikTok or a similar platform. The text in the image reads:

> "I can't wait for the next one"

This is a common phrase used by fans expressing excitement for upcoming content, such as new episodes, videos, or releases. Without additional context, it's difficult to determine exactly what the person is referring to—whether it's a TV show, movie, music release, or something else.

If you'd like help identifying what this might be referring to, feel free to provide more details!

First Bad Commit

At some point shortly after b8089 / 238856ec8

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions