File size: 6,472 Bytes

# OpenSeek-Small-v1-Baseline Model Documentation

## Overview
We sampled 100 billion tokens from the CCI4.0 dataset and trained a 1.4B-parameter MoE model with 0.4B active parameters. This model, along with the dataset, is open-sourced as a baseline for future experiments in areas such as dataset construction, algorithmic strategies, and parallel training frameworks. The model arch is same as OpenSeek-Small-v1 model.

## Training Data
The ratio for each domain is as follows:

| Name                                      | Ratio   |
|-------------------------------------------|---------|
| Nemotron-CC-high-actual-actual-high       | 1.1068  |
| Nemotron-CC-high-actual-actual-low        | 0.3577  |
| Nemotron-CC-high-actual-actual-mid        | 0.7775  |
| Nemotron-CC-high-synthetic-distill-high   | 0.2859  |
| Nemotron-CC-high-synthetic-distill-low    | 0.1672  |
| Nemotron-CC-high-synthetic-distill-mid    | 0.2339  |
| Nemotron-CC-high-synthetic-diverse_qa_pairs-high | 0.5397  |
| Nemotron-CC-high-synthetic-diverse_qa_pairs-low  | 0.4064  |
| Nemotron-CC-high-synthetic-diverse_qa_pairs-mid  | 0.5005  |
| Nemotron-CC-high-synthetic-extract_knowledge-high | 0.4616  |
| Nemotron-CC-high-synthetic-extract_knowledge-low  | 0.0670  |
| Nemotron-CC-high-synthetic-extract_knowledge-mid  | 0.3429  |
| Nemotron-CC-high-synthetic-knowledge_list-high | 0.2610  |
| Nemotron-CC-high-synthetic-knowledge_list-low  | 0.1824  |
| Nemotron-CC-high-synthetic-knowledge_list-mid  | 0.2313  |
| Nemotron-CC-high-synthetic-wrap_medium-high | 0.8237  |
| Nemotron-CC-high-synthetic-wrap_medium-low  | 0.2866  |
| Nemotron-CC-high-synthetic-wrap_medium-mid  | 0.6670  |
| Nemotron-CC-low-synthetic-wrap_medium-high | 0.4657  |
| Nemotron-CC-low-synthetic-wrap_medium-low  | 0.2005  |
| Nemotron-CC-low-synthetic-wrap_medium-mid  | 0.4317  |
| Nemotron-CC-medium-actual-actual-high      | 1.1397  |
| Nemotron-CC-medium-actual-actual-low       | 0.6782  |
| Nemotron-CC-medium-actual-actual-mid       | 0.9175  |
| arxiv                                    | 0.6414  |
| books                                    | 0.4696  |
| code-high                                | 1.0102  |
| code-low                                 | 1.1403  |
| code-mid                                 | 0.9674  |
| cot_synthesis2_CC-high                   | 0.3755  |
| cot_synthesis2_CC-low                    | 0.0499  |
| cot_synthesis2_CC-mid                    | 1.8299  |
| cot_synthesis2_OpenSource-high           | 0.2573  |
| cot_synthesis2_OpenSource-low            | 0.1638  |
| cot_synthesis2_OpenSource-mid            | 0.3251  |
| cot_synthesis2_arxiv-high                | 6.0237  |
| cot_synthesis2_arxiv-low                 | 8.9063  |
| cot_synthesis2_arxiv-mid                 | 10.1376 |
| cot_synthesis2_code-high                 | 0.4598  |
| cot_synthesis2_code-low                  | 0.6857  |
| cot_synthesis2_code-mid                  | 0.8990  |
| cot_synthesis2_math-high                 | 1.3135  |
| cot_synthesis2_math-low                  | 1.6530  |
| cot_synthesis2_math-mid                  | 0.3536  |
| cot_synthesis2_wiki-high                 | 0.6314  |
| cot_synthesis2_wiki-low                  | 0.5978  |
| cot_synthesis2_wiki-mid                  | 0.7909  |
| cot_synthesis_CC-high                    | 0.2225  |
| cot_synthesis_CC-low                     | 0.1797  |
| cot_synthesis_CC-mid                     | 0.2042  |
| cot_synthesis_OpenSource-high            | 0.4081  |
| cot_synthesis_OpenSource-low             | 0.1659  |
| cot_synthesis_OpenSource-mid             | 1.2828  |
| cot_synthesis_arxiv-high                 | 5.68    |
| cot_synthesis_arxiv-low                  | 7.4907  |
| cot_synthesis_arxiv-mid                  | 8.9359  |
| cot_synthesis_code-high                  | 0.7663  |
| cot_synthesis_code-low                   | 0.4052  |
| cot_synthesis_code-mid                   | 0.1916  |
| cot_synthesis_math-high                  | 0.5074  |
| cot_synthesis_math-low                   | 0.6437  |
| cot_synthesis_math-mid                   | 0.6406  |
| cot_synthesis_wiki-high                  | 0.4000  |
| cot_synthesis_wiki-low                   | 0.3564  |
| cot_synthesis_wiki-mid                   | 0.5768  |
| math-high                                | 1.8165  |
| math-low                                 | 1.6940  |
| math-mid                                 | 1.6311  |
| pes2o                                    | 6.1982  |
| pes2o-full-train                         | 1.4257  |
| pes2o-full-val                           | 0.0143  |
| stack                                    | 0.4229  |
| wiki                                     | 0.4202  |
| zh_cc-high-loss0                         | 1.8171  |
| zh_cc-high-loss1                         | 0.9776  |
| zh_cc-high-loss2                         | 0.3725  |
| zh_cc-medium-loss0                       | 0.9492  |
| zh_cc-medium-loss1                       | 0.9236  |
| zh_cc-medium-loss2                       | 1.0643  |

## Wandb
Our training curves have been recorded in Weights & Biases [wandb](https://wandb.ai/openseek-team/OpenSeek-Small-v1-Baseline).

## Evalation
We used the LightEval library for model evaluation, following the same setup as in FineWeb and CCI3-HQ. 
All evaluations were conducted in a zero-shot setting. To directly compare the performance across different datasets, we use Average, which refers to the overall average score across all Chinese and English benchmarks. 
| Metrics           | Score   |
|-------------------|---------|
| HellaSwag         | 42.09   |
| ARC (Average)     | 40.11   |
| PIQA              | 67.14   |
| MMLU (cloze)      | 31.29   |
| CommonsenseQA     | 28.17   |
| TriviaQA          | 6.51    |
| Winograde         | 51.38   |
| OpenBookQA        | 33.00   |
| GSM8K (5-shot)    | 6.67    |
| SIQA              | 41.86   |
| CEval             | 30.19   |
| CMMLU             | 30.25   |
| **Average-English** | **34.82** |
| **Average-Chinese** | **30.22** |
| **Overall Average** | **32.52** |

## Usage Instructions
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("BAAI/OpenSeek-Small-v1-Baseline")
tokenizer = AutoTokenizer.from_pretrained("BAAI/OpenSeek-Small-v1-Baseline")

inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))
```