Qwen3-R1-SLERP-Q3T-8B

This is a merge of pre-trained language models created using mergekit.

Acknowledgements and Special Thanks

First and foremost, I wanted to thank everyone over on the KoboldAI discord server that helped out with my testing and experimentation, none of this would have been possible without the following people who helped out.

Eisenstein for their modified fork of LocalAIME to work better with KoboldCPP and modified sampler settings for Qwen/Deepseek models, and doing half of my testing for me on his machine.
Twistedshadows for loaning me some of their runpod hours to do my testing.
Henky as well, for also loaning me some of their runpod hours, and helping me troubleshoot some issues with getting KCPP to work with LocalAIME
Everyone else on the KoboldAI discord server, there were more than a few willing to help me out in the way of advice, troubleshooting, or offering me their machines or runpod hours to help with testing if the above didn't get to it first.
EntropyMagnets on reddit for making and sharing his LocalAIME tool
Big thanks to https://huggingface.co/none-user for running these tests at a higher precision https://huggingface.co/lemon07r/Qwen3-R1-SLERP-Q3T-8B/discussions/2

I would also like to thank Mradermacher and Bartowski for always posting quants of the models I upload, and the very many other models they get to as well.

GGUF Files

mradermacher's imatrix quants: https://huggingface.co/mradermacher/Qwen3-R1-SLERP-Q3T-8B-i1-GGUF
mradermacher's static quants: https://huggingface.co/mradermacher/Qwen3-R1-SLERP-Q3T-8B-GGUF
static, only Q4_K_S and Q8_0 quants: https://huggingface.co/lemon07r/Qwen3-R1-SLERP-Q3T-8B-Q4_K_S-Q8_0-GGUF

Merge Details

Decided I wanted to do a little experimenting with my new favorite under 10b model, DeepSeek-R1-0528-Qwen3-8B, and merge it with Qwen3-8B when I realized they were similar enough to warrant the attempt (with both preferring the same sampler settings, and being trained on Qwen3 8B Base). The R1 Distill supposedly benches better, and in my own testing, is definitely a better quality writing model. Deepseek had this to say in their DeepSeek-R1-0528-Qwen3-8B model card: "The model architecture of DeepSeek-R1-0528-Qwen3-8B is identical to that of Qwen3-8B, but it shares the same tokenizer configuration as DeepSeek-R1-0528." Which is what made this experiment possible, and of interest to me. They were different enough, being fully trained models from the same base, rather than just finetunes and both very good quality models, to make me think they would be excellent candidates for a SLERP merge. And under further investigation I've found the Deepseek tokenizer and qwen tokenizer have virtually a 100% overlap, making them pretty much interchangeable, and using models trained using either the perfect candidates for testing both tokenizers against each other.

I decided to stick to using SLERP for this 50/50 merge because in the long time I've many models, I've found SLERP merges to be superior to other kinds of merges most times (although there have been very good merges done of other types). Someone else did a similar merge but their configuration was botched.. and missing a layer in the layer_range, so it's now short that layer, or 0.2b parameters according to HF.

Born of this experiment we have two models, Qwen3-R1-SLERP-Q3T-8B and Qwen3-R1-SLERP-DST-8B. They use the same parent models in a 50/50 slerp merge, DeepSeek-R1-0528-Qwen3-8B and Qwen3-8B. The differences are as follows; Q3T uses Qwen3-8B as the base model, and inherits it's tokenizer, the Qwen tokenizer, and the DST model uses DeepSeek-R1-0528-Qwen3-8B as the base and inherits the Deepseek tokenizer.

I was interested in testing these two tokenizers against each other, since deepseek seemed pretty proud of their tokenizer, enough to use it over the Qwen tokenizer in the Qwen3 based R1 Distill. The Qwen tokenizer is actually larger, and I was told by a few others that it means it's more optimized, however I'm not sure how true this is and wasn't able to find anything concrete on this. I was also told that there shouldn't be much of a difference, and both should be good, so much to my surprise, and everyone else involved, there was a pretty noticable difference in our testing. The Qwen tokenizer seemed to perform much better, and use a lot less tokens to get there. And on a side note, Eisenstein ran a script to check for reptitivenes and noted both Qwen and Deepseek were very repitive, but the repitition didn't seem to have any bearing in correctness; since qwen was still correct more times than deepseek. This data is available down below in the results github repo, along with my results and all the raw data.

Due to limitations of available machine power, and the large amount of context used (30k context was used for all testing) I was only able to test these models with Q4_K_S static quants, and only 1 attempt at each problem, and it still took very long to get it all done. It would have been better if I could have tested at higher precision (at least Q8_0), and with more attempts per problem (at least 3-5). If anyone with the means is willing to run their own tests under those better circumstances I hope they share their findings with the community, and if anyone with GPU power wants to sponsor my efforts and let me rerun these tests under better coniditions I would be more than happy to, just reach out to me here or on discord (mim7).

EDIT - @none_user has done exactly this, testing at FP16 and on 3 attempts vs my 1 attempt per problem. Both SLERP merges tested very well, both performing much better than their parents, with the Q3T merge using the Qwen tokenizer being the best of the bunch overall. These slerp merges turned out much better than I expected. Hopefully people will start using this as the base for their future finetunes.

The Other Model

This Q3T merge uses the Qwen tokenizer (and for now, until further testing seems to be the better one). You can find the DST merge, which uses the Deepseek tokenizer here: https://huggingface.co/lemon07r/Qwen3-R1-SLERP-DST-8B

Results and Raw Data Repository

https://github.com/lemon07r/LocalAIME_results

Eisenstein's LocalAIME Fork

https://github.com/jabberjabberjabber/LocalAIME_Kobo

(This fork is tweaked to work better with koboldcpp, and qwen/deepseek models)

LocalAIME Results

EDIT - @none_user has run the same test at a higher precision with more problem attempts here https://huggingface.co/lemon07r/Qwen3-R1-SLERP-Q3T-8B/discussions/2 This is much higher quality data and testing. Big thanks to him for doing this.

A Caveat

Since this came up in some discussion I thought that I should that this method isn't really an amazing way to test tokenizers against each other, since the deepseek part of the two merges are still trained using the deepseek tokenizer, and the qwen part with it's own tokenizer. You would have to train two different versions from the ground up using the different tokenizers on the same exact data to get a completely fair assessment. I still think this testing and further testing is worth doing to see how these merges perform in comparison to their parents, and under which tokenizer they perform better. EDIT - Turns out both tokenizers have almost complete vocab overlap, and should be almost completely interchangable with each other, so the above caveat isn't super relevant.

Merge Method

This model was merged using the SLERP merge method.

Models Merged

The following models were included in the merge:

Configuration

The following YAML configuration was used to produce this model:

slices:
  - sources:
      - model: Qwen/Qwen3-8B
        layer_range: [0, 36]
      - model: deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
        layer_range: [0, 36]
merge_method: slerp
base_model: Qwen/Qwen3-8B
parameters:
  t:
    - filter: self_attn
      value: [0, 0.5, 0.3, 0.7, 1]
    - filter: mlp
      value: [1, 0.5, 0.7, 0.3, 0]
    - value: 0.5
dtype: bfloat16

lemon07r
/

Qwen3-R1-SLERP-Q3T-8B