SmolVLM: Redefining small and efficient multimodal models Paper β’ 2504.05299 β’ Published Apr 7 β’ 191
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model Paper β’ 2502.02737 β’ Published Feb 4 β’ 235
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Paper β’ 2406.17557 β’ Published Jun 25, 2024 β’ 98
StarCoder 2 and The Stack v2: The Next Generation Paper β’ 2402.19173 β’ Published Feb 29, 2024 β’ 147
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents Paper β’ 2306.16527 β’ Published Jun 21, 2023 β’ 46
XTREME-S: Evaluating Cross-lingual Speech Representations Paper β’ 2203.10752 β’ Published Mar 21, 2022 β’ 1