Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2506.05209

MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels

Paper • 2405.07526 • Published May 13, 2024 • 22
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach

Paper • 2405.15613 • Published May 24, 2024 • 18
A Touch, Vision, and Language Dataset for Multimodal Alignment

Paper • 2402.13232 • Published Feb 20, 2024 • 16
How Do Large Language Models Acquire Factual Knowledge During Pretraining?

Paper • 2406.11813 • Published Jun 17, 2024 • 32

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5 • 44

data open access

Open and clean datasets

openbmb/Ultra-FineWeb-classifier

Updated Jun 16 • 18 • 17
HuggingFaceFW/fineweb-2

Viewer • Updated Jun 27 • 5.02B • 466k • 607
PleIAs/common_corpus

Viewer • Updated Jun 10 • 470M • 21.1k • 304
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5 • 44

Common Pile v0.1

All resources related to Common Pile v0.1, an 8TB dataset of public domain and openly licensed text

Common Pile v0.1 Raw Data

Collection

8TB of public domain and openly licensed text • 30 items • Updated Jun 6 • 18
Common Pile v0.1 Filtered Data

Collection

An LLM pre-training dataset produced by filtering and deduplicating the raw text collected in the Common Pile v0.1 • 31 items • Updated Jun 6 • 17
Comma v0.1 Artifacts

Collection

A collection of artifacts related to Comma v0.1—a 7B parameter LLM trained on public domain and openly licensed text • 3 items • Updated Jun 6 • 4
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5 • 44

LLM Pruning and Distillation in Practice: The Minitron Approach

Paper • 2408.11796 • Published Aug 21, 2024 • 59
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Paper • 2408.09174 • Published Aug 17, 2024 • 53
To Code, or Not To Code? Exploring Impact of Code in Pre-training

Paper • 2408.10914 • Published Aug 20, 2024 • 43
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications

Paper • 2408.11878 • Published Aug 20, 2024 • 63

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5 • 44

June 2025 - Top Papers

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Paper • 2506.07044 • Published Jun 8 • 110
ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

Paper • 2506.09513 • Published Jun 11 • 98
AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time

Paper • 2505.24863 • Published May 30 • 96
Seedance 1.0: Exploring the Boundaries of Video Generation Models

Paper • 2506.09113 • Published Jun 10 • 99

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5 • 44

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published Jan 14 • 64
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5 • 44

MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels

Paper • 2405.07526 • Published May 13, 2024 • 22
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach

Paper • 2405.15613 • Published May 24, 2024 • 18
A Touch, Vision, and Language Dataset for Multimodal Alignment

Paper • 2402.13232 • Published Feb 20, 2024 • 16
How Do Large Language Models Acquire Factual Knowledge During Pretraining?

Paper • 2406.11813 • Published Jun 17, 2024 • 32

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5 • 44

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5 • 44

June 2025 - Top Papers

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Paper • 2506.07044 • Published Jun 8 • 110
ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

Paper • 2506.09513 • Published Jun 11 • 98
AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time

Paper • 2505.24863 • Published May 30 • 96
Seedance 1.0: Exploring the Boundaries of Video Generation Models

Paper • 2506.09113 • Published Jun 10 • 99

data open access

Open and clean datasets

openbmb/Ultra-FineWeb-classifier

Updated Jun 16 • 18 • 17
HuggingFaceFW/fineweb-2

Viewer • Updated Jun 27 • 5.02B • 466k • 607
PleIAs/common_corpus

Viewer • Updated Jun 10 • 470M • 21.1k • 304
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5 • 44

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5 • 44

Common Pile v0.1

All resources related to Common Pile v0.1, an 8TB dataset of public domain and openly licensed text

Common Pile v0.1 Raw Data

Collection

8TB of public domain and openly licensed text • 30 items • Updated Jun 6 • 18
Common Pile v0.1 Filtered Data

Collection

An LLM pre-training dataset produced by filtering and deduplicating the raw text collected in the Common Pile v0.1 • 31 items • Updated Jun 6 • 17
Comma v0.1 Artifacts

Collection

A collection of artifacts related to Comma v0.1—a 7B parameter LLM trained on public domain and openly licensed text • 3 items • Updated Jun 6 • 4
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5 • 44

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published Jan 14 • 64
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5 • 44

LLM Pruning and Distillation in Practice: The Minitron Approach

Paper • 2408.11796 • Published Aug 21, 2024 • 59
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Paper • 2408.09174 • Published Aug 17, 2024 • 53
To Code, or Not To Code? Exploring Impact of Code in Pre-training

Paper • 2408.10914 • Published Aug 20, 2024 • 43
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications

Paper • 2408.11878 • Published Aug 20, 2024 • 63

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs