Crash with NaN

#5
by Wonderful-Cat - opened

I'm getting occasional crashes with DeepSeek-V3.1-IQ2_KL using the latest builds of ik_llama. This has been happening for the past couple of days, so it's not an issue with the current latest build. The error message is always:
Oops(ggml_compute_forward_sum_rows_f32, ffn_moe_weights_sum-60): found -nan for i1 = 0, i2 = 0, i3 = 0. ne00 = 8
I'm not sure whether this is an issue with the quant or a bug in ik_llama.

./llama-server --chat-template deepseek3 -ot "ffn_down.weight=CUDA0" -ot "ffn_up.weight=CUDA0" -ot "ffn_gate.weight=CUDA0" --ctx-size 40000 -mla 3 -fa -amb 512 -b 4096 -ub 4096 -fmoe --parallel 1 --threads 12

I'll take a look. I run each quant through a full llama-perplexity run with CPU-only backend and also do a couple vibe tests with llama-server before releasing.

ik just gave us two new features for sanitizing the imatrix during quantization which can fix problems with nans before they happen now, but this just landed yesterday. https://github.com/ikawrakow/ik_llama.cpp/pull/735

ik also gave us a new feature --validate-quants to check the model is good before attempting to run: https://github.com/ikawrakow/ik_llama.cpp/pull/727

none of my quants should have nans in them with --validate-quants but please let me know if you find anything with that.

I'll check the one you're mentioning now msyelf and report back what I find.

Also here is some recommendation for your command:

Your -ot ... regex are really weird. Not sure where you got that. Look at the model card for for this and some of the other models for tips. Also you don't have -ngl so something is strange with what you are doing. Consider this updated example, but ,let me know your CPU/RAM/GPU(s) and operating system if you want more detailed help. Also you're not specifying your model even, maybe you did not provide the full command?

Okay I just tested the DeepSeek-V3.1-IQ2_KL quant and it does not report any nan here.

Possible options:

  1. confirm the sha256sum matches each file as listed on huggingface page as many times people have a corrupt download.
  2. if you're on really old GPUs like P40s maybe that is the problem
  3. your -ot command is strange and could cause issues consider baseline normal usage like this:
-ngl 99 \
-ot exps=CPU \

Then add extra offload to CUDA as desired for routed exps.

I ended up completely reinstalling the system. As far as I can tell, the issue was with the build environment. I had used the --validate-quants command without finding any errors and also compared the checksum, so the model file wasn't corrupted.
In any case, I haven't encountered the error anymore since the reinstall, without changing the command I use to run ik_llama or replacing the model file.
Thank you for your help.

Sign up or log in to comment