Many VLMs claim to process hours of video. But can they follow the story?π€ Today, we introduce TimeScope: The benchmark that separates true temporal understanding from marketing hype. Let's see how much VLMs really understand!β³
The results are in, and they're revealing. Only Gemini 2.5 pro handles 1-hour-long videos. Performance drops sharply with duration, proving that long video understanding is still challenging. We've found the breaking pointsβnow the community can start fixing them.π
Want to learn more? TimeScope is 100% open-source. Benchmark your model and help us build the next generation of video AI.
Fine-tune Gemma3n on videos with audios inside with Colab A100 π₯ Just dropped the notebook where you can learn how to fine-tune Gemma3n on images+audio+text at the same time!
keep in mind, it's made for educational purposes π«‘ we do LoRA, audio resampling & video downsampling to be able to train <40GB VRAM stretch modalities and unfreeze layers as you wish! ππ» merve/smol-vision
They have an image tokenizer unified with text, and they de-tokenize using either of two models (LLM and diffusion) The model is actually a full LLM (Qwen2), the tokenizer converts image tokens π€―
Humans often solve visual problems by sketching ideas in our minds. What if Vision-Language Models (VLMs) could do something similar, not by generating full images, but by using internal βmental sketchesβ?
Thatβs the idea behind Mirage, a new framework that empowers VLMs to reason using latent visual tokens. Instead of just thinking in words, Mirage mixes in abstract visual representations that help the model solve complex tasks.
These aren't photorealistic images. They're compact, internal representations optimized purely to support reasoning.
π§ Mirage is trained in two phases:
1) Grounding: It learns to produce latent tokens anchored in real images. 2) Refinement: The model drops the images and learns to generate visual tokens on its own.
π And yes, it works! On challenging benchmarks like Visual Spatial Planning, Jigsaw puzzles, and Spatial Attention Tasks, Mirage clearly outperforms GPT-4o and other strong baselines. Smart sketches > empty words.
Dataset Viewer for PDFs just landed on Hugging Face ππ€ you can now preview all the PDFs easier than before!
on top of this, there's PdfFolder format to load the PDF datasets quicker π¨ > to use it, your dataset should follow a directory format like folder/train/doc1.pdf, folder/train/doc1.pdf > if you want to include bounding boxes, labels etc. you can keep them in a metadata.csv file in the same folder π€
we've merged LightGlue keypoint matcher to Hugging Face transformers! it allows commercial use when paired with an open-source keypoint detector ππ»