Fine-tune Gemma3n on videos with audios inside with Colab A100 ๐ฅ Just dropped the notebook where you can learn how to fine-tune Gemma3n on images+audio+text at the same time!
keep in mind, it's made for educational purposes ๐ซก we do LoRA, audio resampling & video downsampling to be able to train <40GB VRAM stretch modalities and unfreeze layers as you wish! ๐๐ป merve/smol-vision
They have an image tokenizer unified with text, and they de-tokenize using either of two models (LLM and diffusion) The model is actually a full LLM (Qwen2), the tokenizer converts image tokens ๐คฏ
Dataset Viewer for PDFs just landed on Hugging Face ๐๐ค you can now preview all the PDFs easier than before!
on top of this, there's PdfFolder format to load the PDF datasets quicker ๐จ > to use it, your dataset should follow a directory format like folder/train/doc1.pdf, folder/train/doc1.pdf > if you want to include bounding boxes, labels etc. you can keep them in a metadata.csv file in the same folder ๐ค
we've merged LightGlue keypoint matcher to Hugging Face transformers! it allows commercial use when paired with an open-source keypoint detector ๐๐ป