smol-IQ2_KS on a barely sufficient system
i had some struggles getting smol-IQ2_KS to run on my gaming rig (9800x3d,rtx5090,256GB Ram). But after my first unsucsessfull trys, i realized that my arch linux system used zswap as a swap system. as this is a crompressed ram-disk that was not very useful ;-)
switchíng back to a file based swap was a good idea, but also here i learned that using the coventional way did not work because my disc use btrfs and i had to
"btrfs filesystem mkswapfile ... " . next i switched my gui to the amd igpu, what freed 900M vram on the nvidia. with the start batch:
/build/bin/llama-server
--alias ling
--model /home/user/LLMMODELS2/llm_gguf/ubergarm/Ling-1T-GGUF/smol-IQ2_KS/Ling-1T-smol-IQ2_KS-00001-of-00006.gguf
--ctx-size 32768
-fa -fmoe -ger
-ctk q8_0
-ctv q8_0
-ub 4096
-b 4096
-ngl 99
-ot "blk.(4|5).ffn_.*=CUDA0"
-ot exps=CPU
--parallel 1
--threads 8
--host 0.0.0.0
--port 8888
--no-mmap
--no-display-prompt
now 30,65g vram, 247g ram, 56g swap are used. i try to start only needed programms. (wayland/hyprland, shell for ik_llama, firefox for internal ik_llama gui and/or sillytavern). this barely works now getting 6,5 t/s . thats enough for playing arround .
Heya again, @Hansi2024 !
Sweet you're getting a taste of the big models on your rig! Great job freeing up as much RAM/VRAM as possible to load these big quants!
One suggestion I have is to avoid using any kind of swapping to avoid excessive writes to your nvme/ssd drive. By default ik/llama.cpp use the mmap() feature which allows the gguf files to remain on disk and accessed READ ONLY at run-time and linux file/page cache will juggle any weights that dont' fit into available RAM.
So just by removing --no-mmap and not pre-allocating the space it should be okay. I call this the "troll rig" method when the weights don't fit into RAM+VRAM. It will heat up an nvme drive and can do ~5GB/s, but at least it is read only and no write wear!
you may be able to run this quant with --no-mmap and fit the entire thing into RAM+VRAM though if you drop the batch sizes to free up some VRAM (just remove -ub 4096 -b 4096 as the default values are -ub 512 -b 2048), and try to offload one more layer e.g. ...blk.(4|5|6)...... also you might have to drop the --ctx-size 8192 just to test what you can fit.
Anyway, keep it up and eventually you'll have a collection of commands to run big models squeezed perfectly onto your rig for max performance.
Oh finally you can use llama-sweep-bench to test both PP (prompt processing aka "prefill") and TG (token generation aka decode) using basically the same command as your llama-server but replace it with llama-sweep-bench --warmup-batch -n 64 .....(the rest of your command)
finally, I believe the most recent version of ik_llama.cpp no longer need -fa as it is likely on by default if possible i think. (hard to keep up with the changes hah)
cheers!
@ubergarm , thank you again. I learned much today, again. I didn't know there was no performance penalty when using mmap(). I just noted when I use --no-mmap, the initial model loading speed is very slow (0.5 GB/s) as opposed to (5GB/s). I felt it ran little faster with --no-mmap during inferencing, but in hindsight, it must have been a placebo effect.
@Hansi2024
Thank you for sharing your parameters and t/s. I have a Radeon 9600X + RTX 5070 Ti system, and I get 4 t/s running deepseek. If I run ling 1T, I will probably get 3t/s.