testing
W790E Sage + QYFS + 512G + RTX5090
IQ5_K:
Tensor blk.91.ffn_up_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.attn_norm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.attn_q.weight (size = 66846720 bytes) -- ignoring
model has unused tensor blk.92.attn_k.weight (size = 5570560 bytes) -- ignoring
model has unused tensor blk.92.attn_v.weight (size = 5570560 bytes) -- ignoring
model has unused tensor blk.92.attn_q.bias (size = 49152 bytes) -- ignoring
model has unused tensor blk.92.attn_k.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.92.attn_v.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.92.attn_output.weight (size = 66846720 bytes) -- ignoring
model has unused tensor blk.92.attn_q_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.92.attn_k_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.92.post_attention_norm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_inp.weight (size = 3276800 bytes) -- ignoring
model has unused tensor blk.92.exp_probs_b.bias (size = 640 bytes) -- ignoring
Tensor blk.92.ffn_gate_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.ffn_gate_exps.weight (size = 865075200 bytes) -- ignoring
Tensor blk.92.ffn_down_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.ffn_down_exps.weight (size = 1042022400 bytes) -- ignoring
Tensor blk.92.ffn_up_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.ffn_up_exps.weight (size = 865075200 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_shexp.weight (size = 8355840 bytes) -- ignoring
model has unused tensor blk.92.ffn_down_shexp.weight (size = 8355840 bytes) -- ignoring
model has unused tensor blk.92.ffn_up_shexp.weight (size = 8355840 bytes) -- ignoring
model has unused tensor blk.92.nextn.eh_proj.weight (size = 55705600 bytes) -- ignoring
model has unused tensor blk.92.nextn.enorm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.nextn.hnorm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.nextn.shared_head_norm.weight (size = 20480 bytes) -- ignoring
llm_load_tensors: offloading 93 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 94/94 layers to GPU
llm_load_tensors: CPU buffer size = 41969.77 MiB
llm_load_tensors: CPU buffer size = 43128.77 MiB
llm_load_tensors: CPU buffer size = 43785.02 MiB
llm_load_tensors: CPU buffer size = 43128.77 MiB
llm_load_tensors: CPU buffer size = 43785.02 MiB
llm_load_tensors: CPU buffer size = 34155.58 MiB
llm_load_tensors: CPU buffer size = 612.81 MiB
llm_load_tensors: CUDA0 buffer size = 16308.63 MiB
....................................................................................................
MLA is only available for LLM_ARCH_DEEPSEEK2 -> turning off MLA
llama_new_context_with_model: n_ctx = 71168
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 10503.22 MiB
llama_new_context_with_model: KV self size = 10503.19 MiB, K (q6_0): 5251.59 MiB, V (q6_0): 5251.59 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 3237.77 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1192.05 MiB
llama_new_context_with_model: graph nodes = 4273
llama_new_context_with_model: graph splits = 180
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT to OFF
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op OUT_PROD to OFF
main: n_kv_max = 71168, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 1024 | 0 | 27.031 | 151.53 | 95.807 | 10.69 |
| 4096 | 1024 | 4096 | 27.326 | 149.90 | 114.911 | 8.91 |
| 4096 | 1024 | 8192 | 27.779 | 147.45 | 109.264 | 9.37 |
| 4096 | 1024 | 12288 | 27.695 | 147.90 | 117.012 | 8.75 |
| 4096 | 1024 | 16384 | 28.322 | 144.62 | 124.412 | 8.23 |
-ngl 99 used to force all layers to gpu. is this using that new -moe parameter?
-ngl 99 used to force all layers to gpu. is this using that new -moe parameter?
I still use the original method (and always forget the nmoe flag or whatever), e.g.
...
-ngl 99 \
-ot exps=CPU \
...
I have a write up somewhere explaining it, but basically it says to offload everything to GPU - but then follows up with "just kidding, override the routed experts to CPU/RAM".
That way you keep the attn/shexp/first N dense layers on GPU with the kv-cache, and only the spare routed experts are on slower RAM.
You can offload additional routed experts to VRAM as well, but it is not so efficient given only 8 routed experts are active per token.
If you can offload the whole thing, then that is gonna be faster hah.