Qwen
/

GGUF
conversational

Unable to load with GPU layers

#10
by sambit-paul-poppulo - opened

If I attempt to run this on CUDA device via llama.cpp:

export CUDA_VISIBLE_DEVICES=1
llama-embedding -m Qwen3-Embedding-4B-Q4_K_M.gguf -p "Who is it?<endoftext>" --verbose-prompt --embd-normalize 2 --gpu-layers 10 --pooling last

the embedding values returned are always a list of nan.

image.png
Is this expected behaviour?

sambit-paul-poppulo changed discussion status to closed
sambit-paul-poppulo changed discussion status to open

I tried in koboldcpp and it always was out of memory, even with 10gb free vram. It did load using CPU only but it was very slow on my 9 5900x + 64gb ddr4. I have 2 rtx 3090s on the system.

Sign up or log in to comment