Was going to post this on /r/LocalLLaMa, but apparently it's without moderation at this time :')

bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF

Was able to use previous mistral chat templates, some hints from Qwen templates, and Claude to piece together a seemingly working chat template, tested it with llama.cpp server and got perfect results, though lmstudio still seems to be struggling for some reason (don't know how to specify a jinja file there)

Outlined the details of the script and results in my llama.cpp PR to add the jinja template:

https://github.com/ggml-org/llama.cpp/pull/14349

Start server with a command like this:

./llama-server -m /models/mistralai_Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M.gguf --jinja --chat-template-file /models/Mistral-Small-3.2-24B-Instruct-2506.jinja

and it should be perfect! Hoping it'll work for ALL tools if lmstudio gets an update or something, not just llama.cpp, but very happy to see it works flawlessly in llama.cpp

In the meantime, will try to open a PR to minja to make the strftime work, but no promises :)

upvoted an article about 1 month ago

Article

Accelerating AI for Drug Discovery: Ginkgo’s GDPx Functional Genomics and GDPa Antibody Developability Dataset Series

and 1 other •

Jun 24

• 15

reacted to codelion's post with 👀 about 2 months ago

Post

1616

DeepThink Plugin: Bringing Gemini 2.5's Parallel Reasoning to Open Models

Just released an open-source plugin that implements Google's "Deep Think" reasoning approach for models like DeepSeek R1, Qwen3, and other open models.

Google's recent Gemini 2.5 report introduced Deep Think - a technique where models generate multiple hypotheses in parallel and critique them before arriving at final answers. It achieves SOTA results on math olympiads and competitive coding benchmarks.

Our implementation works by modifying the inference pipeline to explore multiple solution paths simultaneously, then synthesizing the best approach. Instead of single-pass generation, models run an internal debate before responding.

Key features:
- Works with any model that supports structured reasoning patterns
- Implements parallel thinking during response generation
- Particularly effective for complex reasoning tasks, math, and coding problems
- Increases inference time but significantly improves answer quality

The plugin won the Cerebras & OpenRouter Qwen 3 Hackathon, validating that this approach works well beyond Google's proprietary implementation.

GitHub: https://github.com/codelion/optillm/tree/main/optillm/plugins/deepthink
Demo: https://www.youtube.com/watch?v=b06kD1oWBA4

The goal is democratizing advanced reasoning capabilities that were previously locked behind APIs. Perfect for researchers and practitioners working with local deployments who want enhanced reasoning without dependency on proprietary services.

Performance notes: Currently about 2-3x slower inference but much better results on complex problems. Working on adaptive triggering to only activate when problems benefit from parallel reasoning.

Would love feedback from the HF community and collaborations on optimizing the approach further. Open to PRs and always interested in making open models more capable.

upvoted an article 3 months ago

Article

I trained a Language Model to schedule events with GRPO!

•

Apr 29

• 84

liked 2 models 3 months ago

Qwen/Qwen3-32B

Text Generation • 33B • Updated 11 days ago • 1.08M • • 478

Qwen/Qwen3-30B-A3B

Text Generation • 31B • Updated 11 days ago • 1.08M • • 747

reacted to hannayukhymenko's post with 🔥 3 months ago

Post

3593

🚀 We are delighted to announce MamayLM, a new state-of-the-art efficient Ukrainian LLM!

📈 MamayLM surpasses similar-sized models in both English and Ukrainian, while matching or overtaking up to 10x larger models.

📊 MamayLM is a 9B model that can run on a single GPU, enabling cost-efficient AI autonomy and adoption across sectors in Ukraine such as education, legal, healthcare, public services and others (e.g., by specializing it to particular use cases). MalayLM is also attractive for organizations wishing to preserve data privacy as it s efficiency allows it to run on a local machine.

🧠 MamayLM is trained on high-quality Ukrainian data and understands Ukrainian language, culture, and history. It is built on top of Google’s Gemma 2 9B model, but uses a number of new advances stemming from INSAIT’s experience in creating BgGPT, a Bulgarian LLM we released last year, now adopted nationwide and profiled several times by Google as a worldwide success case.

🤝 MamayLM is developed in a collaboration between researchers at INSAIT and ETH Zürich and is trained entirely via donations to INSAIT for AI compute resources.

📥 MamayLM is now freely available to download on INSAIT’s HuggingFace in both full and quantized versions. We also publicly release all Ukrainian benchmarks we evaluated on.

📝 Further, we release blog posts in both English and Ukrainian, sharing our approach to creating MamayLM, hoping to drive further improvements by the community.

🌎 The release of LLMs for various languages is part of INSAIT’s mission in ensuring countries can achieve AI autonomy in a cost-efficient, controlled, safe and predictable manner.

MamayLM model and benchmarks:

INSAIT-Institute
Blog (EN): https://huggingface.co/blog/INSAIT-Institute/mamaylm
Blog (UKR): https://huggingface.co/blog/INSAIT-Institute/mamaylm-ukr

1 reply

liked a model 5 months ago

mistralai/Mistral-Small-3.1-24B-Instruct-2503

24B • Updated 9 days ago • 244k • 1.3k

reacted to clem's post with ❤️ 5 months ago

Post

4712

10,000+ models based on Deepseek R1 have been publicly shared on Hugging Face! Which ones are your favorite ones: https://huggingface.co/models?sort=trending&search=r1. Truly game-changer!

liked a model 5 months ago

novateur/WavTokenizer

Text-to-Speech • Updated Dec 2, 2024 • 52

upvoted an article 6 months ago

Article

SmolVLM Grows Smaller – Introducing the 250M & 500M Models!

and 2 others •

Jan 23

• 182

reacted to JingzeShi's post with 🤯 7 months ago

Post

2081

Only a single RTX 4090 running model pre-training is really slow, even for small language models!!! (https://huggingface.co/collections/JingzeShi/doge-slm-677fd879f8c4fd0f43e05458)

2 replies

upvoted an article 7 months ago

Article

Timm ❤️ Transformers: Use any timm model with transformers

and 4 others •

Jan 16

• 51

reacted to mlabonne's post with 🤗 7 months ago

Post

6758

🆕 LLM Course 2025 edition!

I updated the LLM Scientist roadmap and added a ton of new information and references. It covers training, datasets, evaluation, quantization, and new trends like test-time compute scaling.

The LLM Course has been incredibly popular (41.3k stars!) and I've been touched to receive many, many messages about how it helped people in their careers.

I know how difficult this stuff can be, so I'm super proud of the impact it had. I want to keep updating it in 2025, especially with the LLM Engineer roadmap.

Thanks everyone, hope you'll enjoy it!

💻 LLM Course: https://huggingface.co/blog/mlabonne/llm-course

upvoted an article 7 months ago

Article

The Large Language Model Course

•

Jan 16

• 198

reacted to clem's post with 🚀 8 months ago

Post

2153

Coming back to Paris Friday to open our new Hugging Face office!

We're at capacity for the party but add your name in the waiting list as we're trying to privatize the passage du Caire for extra space for robots 🤖🦾🦿

https://t.co/enkFXjWndJ