|
# OpenSeek-Small-v1-Baseline Model Documentation |
|
|
|
## Overview |
|
We sampled 100 billion tokens from the CCI4.0 dataset and trained a 1.4B-parameter MoE model with 0.4B active parameters. This model, along with the dataset, is open-sourced as a baseline for future experiments in areas such as dataset construction, algorithmic strategies, and parallel training frameworks. The model arch is same as OpenSeek-Small-v1 model. |
|
|
|
## Training Data |
|
The ratio for each domain is as follows: |
|
|
|
| Name | Ratio | |
|
|-------------------------------------------|---------| |
|
| Nemotron-CC-high-actual-actual-high | 1.1068 | |
|
| Nemotron-CC-high-actual-actual-low | 0.3577 | |
|
| Nemotron-CC-high-actual-actual-mid | 0.7775 | |
|
| Nemotron-CC-high-synthetic-distill-high | 0.2859 | |
|
| Nemotron-CC-high-synthetic-distill-low | 0.1672 | |
|
| Nemotron-CC-high-synthetic-distill-mid | 0.2339 | |
|
| Nemotron-CC-high-synthetic-diverse_qa_pairs-high | 0.5397 | |
|
| Nemotron-CC-high-synthetic-diverse_qa_pairs-low | 0.4064 | |
|
| Nemotron-CC-high-synthetic-diverse_qa_pairs-mid | 0.5005 | |
|
| Nemotron-CC-high-synthetic-extract_knowledge-high | 0.4616 | |
|
| Nemotron-CC-high-synthetic-extract_knowledge-low | 0.0670 | |
|
| Nemotron-CC-high-synthetic-extract_knowledge-mid | 0.3429 | |
|
| Nemotron-CC-high-synthetic-knowledge_list-high | 0.2610 | |
|
| Nemotron-CC-high-synthetic-knowledge_list-low | 0.1824 | |
|
| Nemotron-CC-high-synthetic-knowledge_list-mid | 0.2313 | |
|
| Nemotron-CC-high-synthetic-wrap_medium-high | 0.8237 | |
|
| Nemotron-CC-high-synthetic-wrap_medium-low | 0.2866 | |
|
| Nemotron-CC-high-synthetic-wrap_medium-mid | 0.6670 | |
|
| Nemotron-CC-low-synthetic-wrap_medium-high | 0.4657 | |
|
| Nemotron-CC-low-synthetic-wrap_medium-low | 0.2005 | |
|
| Nemotron-CC-low-synthetic-wrap_medium-mid | 0.4317 | |
|
| Nemotron-CC-medium-actual-actual-high | 1.1397 | |
|
| Nemotron-CC-medium-actual-actual-low | 0.6782 | |
|
| Nemotron-CC-medium-actual-actual-mid | 0.9175 | |
|
| arxiv | 0.6414 | |
|
| books | 0.4696 | |
|
| code-high | 1.0102 | |
|
| code-low | 1.1403 | |
|
| code-mid | 0.9674 | |
|
| cot_synthesis2_CC-high | 0.3755 | |
|
| cot_synthesis2_CC-low | 0.0499 | |
|
| cot_synthesis2_CC-mid | 1.8299 | |
|
| cot_synthesis2_OpenSource-high | 0.2573 | |
|
| cot_synthesis2_OpenSource-low | 0.1638 | |
|
| cot_synthesis2_OpenSource-mid | 0.3251 | |
|
| cot_synthesis2_arxiv-high | 6.0237 | |
|
| cot_synthesis2_arxiv-low | 8.9063 | |
|
| cot_synthesis2_arxiv-mid | 10.1376 | |
|
| cot_synthesis2_code-high | 0.4598 | |
|
| cot_synthesis2_code-low | 0.6857 | |
|
| cot_synthesis2_code-mid | 0.8990 | |
|
| cot_synthesis2_math-high | 1.3135 | |
|
| cot_synthesis2_math-low | 1.6530 | |
|
| cot_synthesis2_math-mid | 0.3536 | |
|
| cot_synthesis2_wiki-high | 0.6314 | |
|
| cot_synthesis2_wiki-low | 0.5978 | |
|
| cot_synthesis2_wiki-mid | 0.7909 | |
|
| cot_synthesis_CC-high | 0.2225 | |
|
| cot_synthesis_CC-low | 0.1797 | |
|
| cot_synthesis_CC-mid | 0.2042 | |
|
| cot_synthesis_OpenSource-high | 0.4081 | |
|
| cot_synthesis_OpenSource-low | 0.1659 | |
|
| cot_synthesis_OpenSource-mid | 1.2828 | |
|
| cot_synthesis_arxiv-high | 5.68 | |
|
| cot_synthesis_arxiv-low | 7.4907 | |
|
| cot_synthesis_arxiv-mid | 8.9359 | |
|
| cot_synthesis_code-high | 0.7663 | |
|
| cot_synthesis_code-low | 0.4052 | |
|
| cot_synthesis_code-mid | 0.1916 | |
|
| cot_synthesis_math-high | 0.5074 | |
|
| cot_synthesis_math-low | 0.6437 | |
|
| cot_synthesis_math-mid | 0.6406 | |
|
| cot_synthesis_wiki-high | 0.4000 | |
|
| cot_synthesis_wiki-low | 0.3564 | |
|
| cot_synthesis_wiki-mid | 0.5768 | |
|
| math-high | 1.8165 | |
|
| math-low | 1.6940 | |
|
| math-mid | 1.6311 | |
|
| pes2o | 6.1982 | |
|
| pes2o-full-train | 1.4257 | |
|
| pes2o-full-val | 0.0143 | |
|
| stack | 0.4229 | |
|
| wiki | 0.4202 | |
|
| zh_cc-high-loss0 | 1.8171 | |
|
| zh_cc-high-loss1 | 0.9776 | |
|
| zh_cc-high-loss2 | 0.3725 | |
|
| zh_cc-medium-loss0 | 0.9492 | |
|
| zh_cc-medium-loss1 | 0.9236 | |
|
| zh_cc-medium-loss2 | 1.0643 | |
|
|
|
## Wandb |
|
Our training curves have been recorded in Weights & Biases [wandb](https://wandb.ai/openseek-team/OpenSeek-Small-v1-Baseline). |
|
|
|
## Evalation |
|
We used the LightEval library for model evaluation, following the same setup as in FineWeb and CCI3-HQ. |
|
All evaluations were conducted in a zero-shot setting. To directly compare the performance across different datasets, we use Average, which refers to the overall average score across all Chinese and English benchmarks. |
|
| Metrics | Score | |
|
|-------------------|---------| |
|
| HellaSwag | 42.09 | |
|
| ARC (Average) | 40.11 | |
|
| PIQA | 67.14 | |
|
| MMLU (cloze) | 31.29 | |
|
| CommonsenseQA | 28.17 | |
|
| TriviaQA | 6.51 | |
|
| Winograde | 51.38 | |
|
| OpenBookQA | 33.00 | |
|
| GSM8K (5-shot) | 6.67 | |
|
| SIQA | 41.86 | |
|
| CEval | 30.19 | |
|
| CMMLU | 30.25 | |
|
| **Average-English** | **34.82** | |
|
| **Average-Chinese** | **30.22** | |
|
| **Overall Average** | **32.52** | |
|
|
|
## Usage Instructions |
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
model = AutoModelForCausalLM.from_pretrained("BAAI/OpenSeek-Small-v1-Baseline") |
|
tokenizer = AutoTokenizer.from_pretrained("BAAI/OpenSeek-Small-v1-Baseline") |
|
|
|
inputs = tokenizer("The future of AI is", return_tensors="pt") |
|
outputs = model.generate(**inputs, max_length=50) |
|
print(tokenizer.decode(outputs[0])) |
|
``` |
|
|