OpenR1-Math Collection Dataset and SFT model distilled from DeepSeek-R1. Check out our blog post for more details: https://huggingface.co/blog/open-r1/update-2 β’ 3 items β’ Updated May 13 β’ 9
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model Paper β’ 2502.02737 β’ Published Feb 4 β’ 235
Building and better understanding vision-language models: insights and future directions Paper β’ 2408.12637 β’ Published Aug 22, 2024 β’ 132
view article Article SmolLM - blazingly fast and remarkably powerful By loubnabnl and 2 others β’ Jul 16, 2024 β’ 380
view article Article Ethics and Society Newsletter #6: Building Better AI: The Importance of Data Quality By evijit and 9 others β’ Jun 24, 2024 β’ 34
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Paper β’ 2406.17557 β’ Published Jun 25, 2024 β’ 98
π Dataset comparison models Collection 1.8B models trained on 350BT to compare different pretraining datasets β’ 8 items β’ Updated Jun 12, 2024 β’ 39
StarCoder 2 and The Stack v2: The Next Generation Paper β’ 2402.19173 β’ Published Feb 29, 2024 β’ 147
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only Paper β’ 2306.01116 β’ Published Jun 1, 2023 β’ 35