Was able to use previous mistral chat templates, some hints from Qwen templates, and Claude to piece together a seemingly working chat template, tested it with llama.cpp server and got perfect results, though lmstudio still seems to be struggling for some reason (don't know how to specify a jinja file there)
Outlined the details of the script and results in my llama.cpp PR to add the jinja template:
and it should be perfect! Hoping it'll work for ALL tools if lmstudio gets an update or something, not just llama.cpp, but very happy to see it works flawlessly in llama.cpp
In the meantime, will try to open a PR to minja to make the strftime work, but no promises :)
DeepThink Plugin: Bringing Gemini 2.5's Parallel Reasoning to Open Models
Just released an open-source plugin that implements Google's "Deep Think" reasoning approach for models like DeepSeek R1, Qwen3, and other open models.
Google's recent Gemini 2.5 report introduced Deep Think - a technique where models generate multiple hypotheses in parallel and critique them before arriving at final answers. It achieves SOTA results on math olympiads and competitive coding benchmarks.
Our implementation works by modifying the inference pipeline to explore multiple solution paths simultaneously, then synthesizing the best approach. Instead of single-pass generation, models run an internal debate before responding.
Key features: - Works with any model that supports structured reasoning patterns - Implements parallel thinking during response generation - Particularly effective for complex reasoning tasks, math, and coding problems - Increases inference time but significantly improves answer quality
The plugin won the Cerebras & OpenRouter Qwen 3 Hackathon, validating that this approach works well beyond Google's proprietary implementation.
The goal is democratizing advanced reasoning capabilities that were previously locked behind APIs. Perfect for researchers and practitioners working with local deployments who want enhanced reasoning without dependency on proprietary services.
Performance notes: Currently about 2-3x slower inference but much better results on complex problems. Working on adaptive triggering to only activate when problems benefit from parallel reasoning.
Would love feedback from the HF community and collaborations on optimizing the approach further. Open to PRs and always interested in making open models more capable.
🚀 We are delighted to announce MamayLM, a new state-of-the-art efficient Ukrainian LLM!
📈 MamayLM surpasses similar-sized models in both English and Ukrainian, while matching or overtaking up to 10x larger models.
📊 MamayLM is a 9B model that can run on a single GPU, enabling cost-efficient AI autonomy and adoption across sectors in Ukraine such as education, legal, healthcare, public services and others (e.g., by specializing it to particular use cases). MalayLM is also attractive for organizations wishing to preserve data privacy as it s efficiency allows it to run on a local machine.
🧠 MamayLM is trained on high-quality Ukrainian data and understands Ukrainian language, culture, and history. It is built on top of Google’s Gemma 2 9B model, but uses a number of new advances stemming from INSAIT’s experience in creating BgGPT, a Bulgarian LLM we released last year, now adopted nationwide and profiled several times by Google as a worldwide success case.
🤝 MamayLM is developed in a collaboration between researchers at INSAIT and ETH Zürich and is trained entirely via donations to INSAIT for AI compute resources.
📥 MamayLM is now freely available to download on INSAIT’s HuggingFace in both full and quantized versions. We also publicly release all Ukrainian benchmarks we evaluated on.
📝 Further, we release blog posts in both English and Ukrainian, sharing our approach to creating MamayLM, hoping to drive further improvements by the community.
🌎 The release of LLMs for various languages is part of INSAIT’s mission in ensuring countries can achieve AI autonomy in a cost-efficient, controlled, safe and predictable manner.
I updated the LLM Scientist roadmap and added a ton of new information and references. It covers training, datasets, evaluation, quantization, and new trends like test-time compute scaling.
The LLM Course has been incredibly popular (41.3k stars!) and I've been touched to receive many, many messages about how it helped people in their careers.
I know how difficult this stuff can be, so I'm super proud of the impact it had. I want to keep updating it in 2025, especially with the LLM Engineer roadmap.