Spaces:
Running
Running
metadata
title: README
emoji: π
colorFrom: purple
colorTo: yellow
sdk: static
pinned: false
The Common Pile
We are a group of researchers working together to collect and curate openly licensed and public domain data for training large language models. So far, we have released:
- The Common Pile v0.1, an 8 TB dataset of text from over 30 diverse sources
- Our paper: The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text
- Comma v0.1-1T and Comma v0.1-2T, 7B parameter LLMs trained on text from the Common Pile v0.1
- The training dataset used to train the Comma v0.1 models
- Our code for collecting data from each source
If you're interested in contributing, please open an issue on GitHub!