🖼️ VLMs/OCR > moonshotai/Kimi-VL-A3B-Thinking-2506 is a powerful reasoning vision LM, 3B active params, smarter with less tokens, supports long documents, videos 👏 (OS) > nanonets/Nanonets-OCR-s is 3.75B params OCR model based on Qwen2.5VL-3B-Instruct (OS)
🗣️ Audio > Google released google/magenta-realtime, real time music generation & audio synthesis (cc-by-4) > kyutai released new speech-to-text models that come in 1B & 2B (kyutai/stt-1b-en_fr, stt-2b-en_fr) with 0.5s and 2.5s delay
✨ 32B - Apache 2.0 ✨ 38.0% pass@1 on SWE-bench Verified ✨ Up to 47.0% with test-time scaling ✨ Shows clear data scaling law (8K+ demos) ✨ Built on Qwen2.5-Coder-32B + OpenHands
y'all have been asking my opinion on how OCR models compare to each other 👀 I will leave three apps to compare newest models by @prithivMLmods instead ⤵️ > compare Nanonets-OCR-s, Qwen2-VL-OCR-2B-Instruct, RolmOCR, Aya-Vision prithivMLmods/Multimodal-OCR > SmolDocling, Nanonets-OCR-s, MonkeyOCR, Typhoon-OCR-7B prithivMLmods/Multimodal-OCR2 > docscopeOCR, MonkeyOCR, coreOCR prithivMLmods/core-OCR
Every language carries its own cultural values and worldviews. So, when we build AI systems, we're not just deciding how they speak but also whose perspectives they represent.
Even choosing which dialect to train on in Norway becomes a question of inclusion and power. In Kenya, will AI speak Swahili from Nairobi or coastal regions? What about indigenous languages with rich oral traditions but limited written text, like Quechua in Peru or Cherokee in North America?
The path forward? Building WITH communities, not just FOR them. Working with local partners (libraries, universities, civil society), testing for cultural alignment, and asking hard questions about representation.
so far I figured out > for fact-checks, you need a relatively bigger size (7B is ok!) > Gemma 3 gets downgrade without pan and scan (especially for 📑) > Qwen2.5VL-32B is very talkative, great for reasoning but not good for simple tasks 🗣️