AI & ML interests

None defined yet.

Recent Activity

plcedoz38ย  updated a collection about 2 months ago
Holo1
philmod-hย  updated a model about 2 months ago
Hcompany/Holo1-7B
philmod-hย  updated a model about 2 months ago
Hcompany/Holo1-3B
View all activity

Articles

sergiopaniegoย 
posted an update about 15 hours ago
merveย 
posted an update about 19 hours ago
view post
Post
292
we're all sleeping on this OCR model rednote-hilab/dots.ocr ๐Ÿ”ฅ

dots.ocr is a new 3B model with sota performance, support for 100 languages & allowing commercial use! ๐Ÿคฏ

single e2e model to extract image, convert tables, formula, and more into markdown ๐Ÿ“
try it MohamedRashad/Dots-OCR
sergiopaniegoย 
posted an update 1 day ago
view post
Post
2533
Want to learn how to align a Vision Language Model (VLM) for reasoning using GRPO and TRL? ๐ŸŒ‹

๐Ÿง‘โ€๐Ÿณ We've got you covered!!

NEW multimodal post training recipe to align a VLM using TRL in @HuggingFace 's Cookbook.

Go to the recipe ๐Ÿ‘‰https://huggingface.co/learn/cookbook/fine_tuning_vlm_grpo_trl

Powered by the latest TRL v0.20 release, this recipe shows how to teach Qwen2.5-VL-3B-Instruct to reason over images ๐ŸŒ‹
merveย 
posted an update 1 day ago
view post
Post
424
massive releases and tons of Flux 1. Krea LoRas past week!
here's some of the picks, find more models in collection ๐Ÿซก merve/releases-august-2-6890c14248203522b7d0267f

LLMs ๐Ÿ’ฌ
> Tencent dropped tencent/Hunyuan-7B-Instruct
> Qwen released Qwen/Qwen3-Coder-30B-A3B-Instruct, 30B MoE with 3B params for coding (OS)

vision/multimodal
> RedNote released rednote-hilab/dots.ocr - 3B OCR model (OS)
> Cohere released CohereLabs/command-a-vision-07-2025 - 112B (dense!) VLM for 6 languages
> StepFun-AI shipped stepfun-ai/step3 - 321B MoE VLM (OS)
> Skywork shipped Skywork/Skywork-UniPic-1.5B - new any-to-any model (image+text โ†’ image+text) (OS)
sergiopaniegoย 
posted an update 2 days ago
view post
Post
4167
Just included example scripts for aligning models using GSPO (including VLM example) ๐Ÿ™†โ€โ™‚๏ธ๐Ÿ™†โ€โ™‚๏ธ

GSPO is the latest RL alignment algo by @Alibaba_Qwen and it's already supported in the latest TRL v0.20 release.

Super-easy-to-get-started example scripts below, GO run them!๐Ÿ‘ฉโ€๐Ÿ’ป๐Ÿ‘ฉโ€๐Ÿ’ป

๐Ÿง‘โ€๐ŸŽจ Script: https://github.com/huggingface/trl/blob/main/examples/scripts/gspo.py
๐Ÿฆ„ VLM script: https://github.com/huggingface/trl/blob/main/examples/scripts/gspo_vlm.py
๐Ÿงฉ More TRL examples: https://huggingface.co/docs/trl/main/en/example_overview
๐Ÿง™โ€โ™‚๏ธ GSPO paper: Group Sequence Policy Optimization (2507.18071)
merveย 
posted an update 6 days ago
sergiopaniegoย 
posted an update 7 days ago
view post
Post
287
Did you miss this? ๐Ÿ‘“

๐Ÿง™โ€โ™‚๏ธvLLM + transformers integration just got upgraded with direct VLM support.

Select a VLM + model_impl=transformers and play via vLLM!
merveย 
posted an update 7 days ago
view post
Post
3499
past week in open AI was insane ๐Ÿ”ฅ here's some of picks, find more here merve/releases-july-25-688768ca47fe3693407e02d1

๐Ÿ’ฌ LLMs & VLMs
> Qwen/Qwen3-235B-A22B-Thinking-2507 had a new update (OS)
> Qwen/Qwen3-Coder-480B-A35B-Instruct is out with 480B total 35B active params ๐Ÿคฏ (OS)
> AllenAI dropped an update to allenai/olmOCR-7B-0725 ๐Ÿ“
> InternLM released internlm/Intern-S1 - 235B Qwen3 MoE + 6B InternViT encoder (OS)
> OmniSVG/OmniSVG is a new SVG generation VLM (OS)

๐Ÿ–ผ๏ธ image/video/3D generation
> WanAI released Wan2.2 series - both T2V and I2V 14B models for high-quality video generation (OS) multimodalart/wan-22-688767e313337b434ed55112
> Tencent dropped tencent/HunyuanWorld-1 - image-to-3D scene generation model
  • 1 reply
ยท
sergiopaniegoย 
posted an update 8 days ago
view post
Post
2550
We just released TRL v0.20 with major multimodal upgrades!

๐Ÿ‘๏ธ VLM support for GRPO (highly requested by the community!)
๐ŸŽž๏ธ New GSPO trainer (from @Qwen , released last week, VLM-ready)
๐Ÿ™ New MPO trainer (multimodal by design, as in the paper)

๐Ÿ“ Full release notes here: https://github.com/huggingface/trl/releases/tag/v0.20.0
merveย 
posted an update 9 days ago
view post
Post
4303
๐Ÿคฏ 241B VLM with apache-2.0 license internlm/Intern-S1

internlm released Intern-S1: multimodal reasoning model based on 235B MoE Qwen3 and 6B InternViT ๐Ÿ˜

benchmarks look great (๐Ÿ‘‘ best model โœ… best open model)
sergiopaniegoย 
posted an update 14 days ago
view post
Post
1162
Yet Another New Multimodal Fine-Tuning Recipe ๐Ÿฅง

๐Ÿง‘โ€๐Ÿณ In this @HuggingFace Face Cookbook notebook, we demonstrate how to align a multimodal model (VLM) using Mixed Preference Optimization (MPO) using trl.

๐Ÿ’ก This recipe is powered by the new MPO support in trl, enabled through a recent upgrade to the DPO trainer!

We align the multimodal model using multiple optimization objectives (losses), guided by a preference dataset (chosen vs. rejected multimodal pairs).

Check it out! โžก๏ธ https://huggingface.co/learn/cookbook/fine_tuning_vlm_mpo
  • 2 replies
ยท
merveย 
posted an update 14 days ago
view post
Post
777
so many open LLMs and image LoRAs dropped past week, here's some picks for you ๐Ÿซก merve/releases-july-18-687e3fbd2ab9b39c51f9238b

LLMs
> ByteDance released a bunch of translation models called Seed-X-RM (7B) ByteDance-Seed/Seed-X-RM-7B
> NVIDIA released reasoning models of which 32B surpassing the giant Qwen3-235B with cc-by-4.0 license ๐Ÿ‘ nvidia/openreasoning-nemotron-687730dae0170059860f1f01
> LG released a new EXAONE model (32B) LGAI-EXAONE/EXAONE-4.0-32B

VLMs/any-to-any
> vidore/colqwen-omni-v0.1 is a new any-to-any retriever (MIT)
> HiDream-ai/HiDream-E1-1 is image+text in image+text out model (MIT)

LoRAs
> There's a bunch of LoRAs based on Flux Kontext, gotta check out the collection ๐Ÿค 
merveย 
posted an update 16 days ago
sergiopaniegoย 
posted an update 19 days ago
view post
Post
1646
๐Ÿง‘โ€๐Ÿณ New Multimodal Fine-Tuning Recipe ๐Ÿง‘โ€๐Ÿณ

โšก๏ธ In this new @huggingface Cookbook recipe, I walk you though the process of fine tuning a Visual Language Model (VLM) for Object Detection with Visual Grounding, using TRL.

๐Ÿ” Object detection typically involves detecting categories in images (e.g., vase).

By combining it with visual grounding, we add contextual understanding so instead of detecting just "vase", we can detect "middle vase" in an image.

VLMs are super powerful!

In this case, I use PaliGemma 2 which already supports object detection and extend it to also add visual grounding.

๐Ÿค— Check it out here: https://huggingface.co/learn/cookbook/fine_tuning_vlm_object_detection_grounding
sergiopaniegoย 
posted an update 20 days ago
view post
Post
1609
Multiple NEW notebooks and scripts added to the Hugging Face Gemma recipes repo!

Thanks to the community ๐Ÿซถ, we're adding more and more recipes using Gemma ๐Ÿ’Ž

Fine tuning for all modalities, function calling, RAG...

Repo: https://github.com/huggingface/huggingface-gemma-recipes

We're also open to new ideas from the community ๐Ÿค—!
  • 1 reply
ยท
merveย 
posted an update 20 days ago
merveย 
posted an update 21 days ago
view post
Post
2600
Fine-tune Gemma3n on videos with audios inside with Colab A100 ๐Ÿ”ฅ
Just dropped the notebook where you can learn how to fine-tune Gemma3n on images+audio+text at the same time!

keep in mind, it's made for educational purposes ๐Ÿซก we do LoRA, audio resampling & video downsampling to be able to train <40GB VRAM

stretch modalities and unfreeze layers as you wish! ๐Ÿ™๐Ÿป merve/smol-vision
  • 1 reply
ยท
sergiopaniegoย 
posted an update 23 days ago
sergiopaniegoย 
posted an update 23 days ago
merveย 
posted an update 23 days ago
view post
Post
2430
past week had huuuge releases ๐Ÿ’—
here's our picks ๐Ÿ”ฅ find more models, datasets, demos here merve/releases-july-11-68750452c358c98b0fa663f7

> moonshotai/Kimi-K2-Instruct is the new sota LLM with 1T total 32B active parameters ๐Ÿคฏ

> HuggingFaceTB/SmolLM3-3B is the new best LM for it's size, offers thinking mode ๐Ÿ’ญ as well as the dataset HuggingFaceTB/smoltalk2

> Alibaba-NLP/WebSailor-3B is the new agentic LLM for complex browsing

> Google DeepMind released medical vision LMs with an agentic doctor-patient app google/medgemma-release-680aade845f90bec6a3f60c4

> fal released a LoRA to improve details on face images fal/Realism-Detailer-Kontext-Dev-LoRA