i just created merged pr for fast and accurate one

#5
by gopi87 - opened

Cool I am able to get good almost 60 tok/s generation speed on L40S 48GB on IQ4_NL:

time ./llama-cli -m /local_disk0/IQ4_NL/Qwen3_Next_80B_A3B_Instruct-IQ4_NL-00001-of-00013.gguf --no-mmap --prompt 'write paragraph on quantum computing' -st -ngl 100

....

llama_perf_sampler_print: sampling time = 14.62 ms / 232 runs ( 0.06 ms per token, 15869.76 tokens per second)
llama_perf_context_print: load time = 5073.44 ms
llama_perf_context_print: prompt eval time = 57.52 ms / 13 tokens ( 4.42 ms per token, 226.02 tokens per second)
llama_perf_context_print: eval time = 3650.05 ms / 218 runs ( 16.74 ms per token, 59.73 tokens per second)
llama_perf_context_print: total time = 3769.67 ms / 231 tokens
llama_perf_context_print: graphs reused = 0
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - CUDA0 (L40S) | 45596 = 1651 + (43297 = 42821 + 171 + 304) + 647 |
llama_memory_breakdown_print: | - Host | 178 = 166 + 0 + 12 |

real 0m9.417s

Sign up or log in to comment