Thinking/Non-Thinking mode

#4
by mtcl - opened

Hey! Thank you for the quants. How do we switch between thinking and non thinking mode?

also which layers we can offload to gpu, i recall that there was an oddity in models that we cannto start with first layer always when using -ot parameter. lol. How do I even find that out :)

@mtcl

Check out this post: https://huggingface.co/ubergarm/DeepSeek-V3.1-GGUF/discussions/1#68ac68e2361af4a168095655

It has a couple links to follow and an PR for ik_llama.cpp that might add a feature to allow it easily at command line. I'm also not 100% if it is just thinking or enable_thinking and need to look into it some more when I'm back at my desk later this week.

i recall that there was an oddity in models that we cannto start with first layer always when using -ot parameter. lol. How do I even find that out :)

So different models have a different number of first N dense layers e.g. ffn_(gate|down|up) and also usually one or none shared experts e.g. ffn_(gate|down|up)_exps and finally all MoEs have routed experts ffn_(gate|down|up)_exps

So for like DeepSeek which has first 3 dense layers [0-2] you start at 3 e.g. like this: (keep in mind order matters here for regex matching):

-ngl 99 \
-ot "blk\.(3|4|5)\.ffn_.*=CUDA0" \
-ot "blk\.(6|7|8)\.ffn_.*=CUDA1" \
-ot exps=CPU \

For Kimi-K2 and some other models with only a single dense layer you start then at 1 e.g. -ot "blk\.(2|3|4)\.ffn_.*=CUDA0" \ etc.

You can tell by looking at the tensor names in the huggingface GGUF file viewer e.g. here and see when the routed and shared experts begin and start at that layer. Look here for example: https://huggingface.co/ubergarm/DeepSeek-V3.1-GGUF?show_file_info=IQ2_KT%2FDeepSeek-V3.1-IQ2_KT-00001-of-00005.gguf

I have no idea what or why -n-cpu-moe exists, it is some kind of shortcut but it only confuses me as it gives more options that personally I don't need. You'd have to read the PRs to see what it does, which is basically this same idea but easier for some people I suppose.

Hope that helps, if it is only more confusing let me know, maybe I have to make a video about it myself ;p

Catch u later this week, let me know if you figure out the thinking/nothinking! There is a lot going on with the various endpoints open ai complaint style, built in chat completions, text completions etc. So yeah it is kinda confusing. Find one way that works for your given client systems and stick with it! Cheers!

Sign up or log in to comment