boun-tabi-LMG (boun-tabi-LMG)

merve

posted an update about 18 hours ago

Post

292

we're all sleeping on this OCR model rednote-hilab/dots.ocr 🔥

dots.ocr is a new 3B model with sota performance, support for 100 languages & allowing commercial use! 🤯

single e2e model to extract image, convert tables, formula, and more into markdown 📝
try it MohamedRashad/Dots-OCR

merve

posted an update 1 day ago

Post

424

massive releases and tons of Flux 1. Krea LoRas past week!
here's some of the picks, find more models in collection 🫡 merve/releases-august-2-6890c14248203522b7d0267f

LLMs 💬
> Tencent dropped tencent/Hunyuan-7B-Instruct
> Qwen released Qwen/Qwen3-Coder-30B-A3B-Instruct, 30B MoE with 3B params for coding (OS)

vision/multimodal
> RedNote released rednote-hilab/dots.ocr - 3B OCR model (OS)
> Cohere released CohereLabs/command-a-vision-07-2025 - 112B (dense!) VLM for 6 languages
> StepFun-AI shipped stepfun-ai/step3 - 321B MoE VLM (OS)
> Skywork shipped Skywork/Skywork-UniPic-1.5B - new any-to-any model (image+text → image+text) (OS)

merve

posted an update 6 days ago

Post

2059

Cohere just dropped CohereLabs/command-a-vision-07-2025, a 112B (dense!) vision LM
> based on SigLIP2 & Command-A
> built for enterprise use cases 🔥
> use with Inference Providers or transformers 🤗
read their blog https://huggingface.co/blog/CohereLabs/introducing-command-a-vision-07-2025

2 replies

·

merve

posted an update 7 days ago

Post

3499

past week in open AI was insane 🔥 here's some of picks, find more here merve/releases-july-25-688768ca47fe3693407e02d1

💬 LLMs & VLMs
> Qwen/Qwen3-235B-A22B-Thinking-2507 had a new update (OS)
> Qwen/Qwen3-Coder-480B-A35B-Instruct is out with 480B total 35B active params 🤯 (OS)
> AllenAI dropped an update to allenai/olmOCR-7B-0725 📝
> InternLM released internlm/Intern-S1 - 235B Qwen3 MoE + 6B InternViT encoder (OS)
> OmniSVG/OmniSVG is a new SVG generation VLM (OS)

🖼️ image/video/3D generation
> WanAI released Wan2.2 series - both T2V and I2V 14B models for high-quality video generation (OS) multimodalart/wan-22-688767e313337b434ed55112
> Tencent dropped tencent/HunyuanWorld-1 - image-to-3D scene generation model

1 reply

·

merve

posted an update 9 days ago

Post

4303

🤯 241B VLM with apache-2.0 license internlm/Intern-S1

internlm released Intern-S1: multimodal reasoning model based on 235B MoE Qwen3 and 6B InternViT 😍

benchmarks look great (👑 best model ✅ best open model)

merve

posted an update 14 days ago

Post

777

so many open LLMs and image LoRAs dropped past week, here's some picks for you 🫡 merve/releases-july-18-687e3fbd2ab9b39c51f9238b

LLMs
> ByteDance released a bunch of translation models called Seed-X-RM (7B) ByteDance-Seed/Seed-X-RM-7B
> NVIDIA released reasoning models of which 32B surpassing the giant Qwen3-235B with cc-by-4.0 license 👏 nvidia/openreasoning-nemotron-687730dae0170059860f1f01
> LG released a new EXAONE model (32B) LGAI-EXAONE/EXAONE-4.0-32B

VLMs/any-to-any
> vidore/colqwen-omni-v0.1 is a new any-to-any retriever (MIT)
> HiDream-ai/HiDream-E1-1 is image+text in image+text out model (MIT)

LoRAs
> There's a bunch of LoRAs based on Flux Kontext, gotta check out the collection 🤠

merve

posted an update 16 days ago

Post

2775

Now it's possible to do RAG with any-to-any models 🔥

Learn how to search in a video dataset and generate using Tevatron/OmniEmbed-v0.1-multivent an all modality retriever, and Qwen/Qwen2.5-Omni-7B, any-to-any model in this notebook 🤝 merve/smol-vision

merve

posted an update 20 days ago

Post

2050

all modality RAG 🔥

ColQwen-Omni is a new multimodal retrieval model that can retrieve anything (videos, audios, documents and more!)

use with transformers 🤗
read the blog https://huggingface.co/blog/manu/colqwen-omni-omnimodal-retrieval
model repository vidore/colqwen-omni-v0.1

merve

posted an update 21 days ago

Post

2600

Fine-tune Gemma3n on videos with audios inside with Colab A100 🔥
Just dropped the notebook where you can learn how to fine-tune Gemma3n on images+audio+text at the same time!

keep in mind, it's made for educational purposes 🫡 we do LoRA, audio resampling & video downsampling to be able to train <40GB VRAM

stretch modalities and unfreeze layers as you wish! 🙏🏻 merve/smol-vision

1 reply

·

merve

posted an update 23 days ago

Post

2430

past week had huuuge releases 💗
here's our picks 🔥 find more models, datasets, demos here merve/releases-july-11-68750452c358c98b0fa663f7

> moonshotai/Kimi-K2-Instruct is the new sota LLM with 1T total 32B active parameters 🤯

> HuggingFaceTB/SmolLM3-3B is the new best LM for it's size, offers thinking mode 💭 as well as the dataset HuggingFaceTB/smoltalk2

> Alibaba-NLP/WebSailor-3B is the new agentic LLM for complex browsing

> Google DeepMind released medical vision LMs with an agentic doctor-patient app google/medgemma-release-680aade845f90bec6a3f60c4

> fal released a LoRA to improve details on face images fal/Realism-Detailer-Kontext-Dev-LoRA

merve

posted an update 28 days ago

Post

3119

GitHub refuses to render notebooks for a long time now 💔

so smol-vision now lives in Hugging Face model repository 🤗 merve/smol-vision

1 reply

·

merve

posted an update 29 days ago

Post

3447

ByteDance released Tar 1.5B and 7B: image-text in image-text out models, fully open-source 👏 ByteDance-Seed/tar-6864cf0d9fe59a3b91cc4260

They have an image tokenizer unified with text, and they de-tokenize using either of two models (LLM and diffusion)
The model is actually a full LLM (Qwen2), the tokenizer converts image tokens 🤯

merve

posted an update 30 days ago

Post

3690

Huge drops in open AI past week!
Find more models, datasets, demos here merve/releases-july-4-686bcc54ed7c45c341fbf654
Some of our picks 🫡
⏯️ BAAI/MTVCraft is a new Veo3-like text-to-video model, demo is here BAAI/MTVCraft
🧑🏻‍💻 apple/diffucoder-6868139f56672ae046fe04e8 is a new family of diffusion LLMs (7B base and instruct) for coding
🗣️ kyutai/tts-1.6b-en_fr is a new small TTS model for English and France
👀 aharley/alltracker is a new pixel tracking model by Stanford, demo is here aharley/alltracker
📖 racineai/OGC_MEGA_MultiDomain_DocRetrieval is a new large visual document retrieval dataset

1 reply

·

merve

posted an update about 1 month ago

Post

963

SOOOO MANY MODEL RELEASES 😍
Here's some picks from past week 🤗

> ByteDance/XVerse is a new identity preserving image generation model 🖼️
> google/gemma-3n-E4B-it, any-to-text model supported by transformers 🤗
> nvidia/llama-nemoretriever-colembed-3b-v1 two new state-of-the-art visual document retrievers 📑
> New version of Dia TTS model is up nari-labs/Dia-1.6B-0626
> Black Forest Labs releases Kontext benchmark black-forest-labs/kontext-bench

Find more here merve/releases-june-27-6864e8eb17f7e3a8b444083c

merve

posted an update about 1 month ago

Post

3038

visual reasoning is now in transformers 🔥
https://huggingface.co/THUDM/GLM-4.1V-9B-Thinking is just released and merged into transformers, we gave it a vibe test run 🤠

it's very good, comes with 64k context length and MIT license 😍
it supports 4k image tokens and any aspect ratio as well!
Notebook: http://colab.research.google.com/drive/1atODIiV57hOZLv16Bjzwd6fwx0yoTorj?usp=sharing
Demo: https://huggingface.co/spaces/THUDM/GLM-4.1V-9B-Thinking-Demo

merve

posted an update about 1 month ago

Post

2546

so many multimodal releases these days 🤠
> ERNIE-4.5-VL: new vision language MoE models by Baidu https://huggingface.co/models?search=ernie-4.5-vl
> new visual document retrievers by NVIDIA (sota on ViDoRe!) nvidia/llama-nemoretriever-colembed-3b-v1 nvidia/llama-nemoretriever-colembed-1b-v1
> Ovis-3b: new image-text in image-text out models by Alibaba ⤵️ https://huggingface.co/spaces/AIDC-AI/Ovis-U1-

merve

posted an update about 1 month ago

Post

616

Dataset Viewer for PDFs just landed on Hugging Face 📖🤗 you can now preview all the PDFs easier than before!

on top of this, there's PdfFolder format to load the PDF datasets quicker 💨
> to use it, your dataset should follow a directory format like folder/train/doc1.pdf, folder/train/doc1.pdf
> if you want to include bounding boxes, labels etc. you can keep them in a metadata.csv file in the same folder 🤝

read document dataset docs https://huggingface.co/docs/datasets/main/en/document_dataset
check all the document datasets here https://huggingface.co/datasets?modality=modality:document&sort=trending 📖

1 reply

·

merve

posted an update about 1 month ago

Post

654

we've merged LightGlue keypoint matcher to Hugging Face transformers! it allows commercial use when paired with an open-source keypoint detector 🙏🏻

it works very well, try it yourself: ETH-CVG/LightGlue

here's an in-the-wild test with two images of the same place ⤵️

1 reply

·

merve

posted an update about 1 month ago

Post

4354

Release picks of the past week is here! Find more models, datasets, Spaces here merve/june-20-releases-68594824d1f4dfa61aee3433

🖼️ VLMs/OCR
> moonshotai/Kimi-VL-A3B-Thinking-2506 is a powerful reasoning vision LM, 3B active params, smarter with less tokens, supports long documents, videos 👏 (OS)
> nanonets/Nanonets-OCR-s is 3.75B params OCR model based on Qwen2.5VL-3B-Instruct (OS)

💬 LLMs
> moonshotai/Kimi-Dev-72B is a strong coding model based on Qwen2.5-72B (OS)
> Mistral released mistralai/Mistral-Small-3.2-24B-Instruct-2506, an update to their former model with better function calling & instruction following (OS)

🗣️ Audio
> Google released google/magenta-realtime, real time music generation & audio synthesis (cc-by-4)
> kyutai released new speech-to-text models that come in 1B & 2B ( kyutai/stt-1b-en_fr, stt-2b-en_fr) with 0.5s and 2.5s delay

3D
> Tencent released tencent/Hunyuan3D-2.1 an image-to-3D model (see below)

merve

posted an update about 1 month ago

Post

5049

fav open-source multimodal reasoning model just got an update 🔥

moonshotai/Kimi-VL-A3B-Thinking-2506 has
> smarter with less tokens, small size (only 3B active params!!!)
> better accuracy
> video reasoning
> higher resolution 🤯
Read their blog https://huggingface.co/blog/moonshotai/kimi-vl-a3b-thinking-2506

AI & ML interests

Team members 8

boun-tabi-LMG's activity