GGUF
qwen3
conversational

Distillation Requests

#1
by Liontix - opened
Owner

If you want a distilled version of a commercial/closed-source model that is available through openrouter then just drop a comment here. Please also mention if you want just the dataset or which open source model I should use as the base model for creating the distilled model, e.g., Qwen 3. (If you prefer a certain parameter size, mention that too.)

I have a question regarding distillation. In general, how much data is considered appropriate for distilling a model? I noticed that some of the datasets you used may contain only around 100 samples, whereas in the technical report released by DeepSeek, they seemed to have used about 800k.

Owner

Using more samples in a dataset to distill from is better in terms of retraining the model, because you can adapt/reshape the model better. The method used for distillation fine-tuning consists of using reinforcement learning (reward-based learning), where you feed your prompt and evaluate the result, and then some tensors of the model are changed to provide a more fitting result. That process is repeated for a set amount of steps, and this also depends on the size of the dataset. When you take a dataset with 100K+ results, then the model is going to be performing a lot better than with these tiny datasets. With my models using those small datasets, I only try to get the base model to mimic the responding style of the LLM that the dataset originates from. So I only retrain a small part of the model to mimic how the closed-source model thinks/responds.

TL;DR
Datasets with < 10 K samples will only let you mimic the responding style, and datasets with 100 K+ samples will change the base model on a deeper level, creating a better model. The relation between size and resulting quality is not linear; you also need to consider the quality of the dataset (like covering different topics, logic problems, coding tasks, ...).

Hey @Liontix ,

I’d be really interested in seeing a distillation of Claude Opus 4, but specifically with a dataset focused on complex coding tasks (algo problems, debugging, multi-step reasoning, etc.). Ideally, I’m thinking something like 1k+ samples minimum, but the more the better—especially if it leans towards the 100k+ range you mentioned where the model actually reshapes itself beyond just mimicking style.

Do you think putting together a dataset like that is doable? And if so, what base model would you consider best for coding-heavy distillation—Qwen 3 or something else?

Using more samples in a dataset to distill from is better in terms of retraining the model, because you can adapt/reshape the model better. The method used for distillation fine-tuning consists of using reinforcement learning (reward-based learning), where you feed your prompt and evaluate the result, and then some tensors of the model are changed to provide a more fitting result. That process is repeated for a set amount of steps, and this also depends on the size of the dataset. When you take a dataset with 100K+ results, then the model is going to be performing a lot better than with these tiny datasets. With my models using those small datasets, I only try to get the base model to mimic the responding style of the LLM that the dataset originates from. So I only retrain a small part of the model to mimic how the closed-source model thinks/responds.

TL;DR
Datasets with < 10 K samples will only let you mimic the responding style, and datasets with 100 K+ samples will change the base model on a deeper level, creating a better model. The relation between size and resulting quality is not linear; you also need to consider the quality of the dataset (like covering different topics, logic problems, coding tasks, ...).


Thanks for your reply.

I roughly understand what you mean, so may I ask whether you have considered using a somewhat larger dataset (e.g., 1K+ samples) for distillation? The dataset size itself may not be the key point; rather, the focus could be on ensuring that the smaller models, after distillation, not only imitate the output behaviors of stronger models such as Grok, Gemini, or Claude, but also achieve improvements on specific tasks (e.g., QA, AIME, etc.). At the same time, have you considered evaluating the performance of these distilled models and the potential effects of different datasets? I believe this would be very meaningful. If you are considering this direction, it would be fantastic if you could release these models and datasets, along with details on how the data was generated and the training procedures of the models (this would make for an excellent blog). Thank you very much!

Owner

Hello @efgry ,

I would also like to do a larger distillation dataset of claude opus 4, but the costs of using the model via the API to create the dataset (I am using openrouter) with $15/M input tokens and $75/M output tokens would cost me a lot of money. A demo query requesting a simple python script from it cost me 0.21 $ (with 154 prompt tokens, 2810 completion tokens including just 132 reasoning tokens). Prompting a thousand of complex queries would cost me a approximately around 400 - 800 $.

Owner

Hello @Nick-Awesome ,

I will look into creating a larger dataset of possibly gemini 2.5 pro or flash and fine-tune a model on it. After running some benchmarks on the models trained on the smaller and the larger datasets, we could get a better insight into how that actually affects the model. And I would also create a blog post covering the whole process and results. If you have specific benchmarks in your mind that I should test, then please let me know.

Hello @Nick-Awesome ,

I will look into creating a larger dataset of possibly gemini 2.5 pro or flash and fine-tune a model on it. After running some benchmarks on the models trained on the smaller and the larger datasets, we could get a better insight into how that actually affects the model. And I would also create a blog post covering the whole process and results. If you have specific benchmarks in your mind that I should test, then please let me know.


Thank you very much for considering my suggestion. Conducting a systematic analysis of the capability improvements achieved through model distillation would be an excellent piece of work, as the community still lacks clarity on what constitutes effective distillation. I hope your work can provide valuable insights to everyone, and I am really looking forward to the release of your blog.

Good day thank you for your reply ,
I recently used Qwen/Qwen2.5-VL-7B-Instruct for my project on hand written text recognition..I don't know if an equivalent of a QWEN 3 model will be compatible as well ..
Based on the size the raw 2.5v 7billion is a bit large
Thanks

Good day thank you for your reply ,
I recently used Qwen/Qwen2.5-VL-7B-Instruct for my project on hand written text recognition..I don't know if an equivalent of a QWEN 3 model will be compatible as well ..
Based on the size the raw 2.5v 7billion is a bit large
Thanks

We are working to get some Qwen/Qwen3-VL-8B-Thinking distills. Just trying to figure out whether or not to finetune vision as well. Any datasets in mind?

Yes concerning the handwriting data IAM full pages data set ...but I did the line and word segmentation when embarking the project ..even tho I used only the line
So it's 3
Full paged
Line
Word

Two new free stealth models on openrouter: Sherlock Think Alpha and Sherlock Dash Alpha. Believed to grok but not confirmed.

Hello @tikeape ,

Thank you for the ping. Both models 1000 example distill datasets are queued.

Hi @Liontix
No offense but I am getting a bit inpatient

Hi @Liontix
I can not find a model can you do it but i need 256k context windw input and output or better the most you can and can you make it private llm so only i can acses it

Hellos @CodeFlame . Nothing you have said has made sense

can anyone assist with a gguf of legal-bert. Thanks.

Sign up or log in to comment