Update README.md

f3dfd70 verified 21 days ago

6.47 kB

	# OpenSeek-Small-v1-Baseline Model Documentation

	## Overview
	We sampled 100 billion tokens from the CCI4.0 dataset and trained a 1.4B-parameter MoE model with 0.4B active parameters. This model, along with the dataset, is open-sourced as a baseline for future experiments in areas such as dataset construction, algorithmic strategies, and parallel training frameworks. The model arch is same as OpenSeek-Small-v1 model.

	## Training Data
	The ratio for each domain is as follows:

	\| Name \| Ratio \|
	\|-------------------------------------------\|---------\|
	\| Nemotron-CC-high-actual-actual-high \| 1.1068 \|
	\| Nemotron-CC-high-actual-actual-low \| 0.3577 \|
	\| Nemotron-CC-high-actual-actual-mid \| 0.7775 \|
	\| Nemotron-CC-high-synthetic-distill-high \| 0.2859 \|
	\| Nemotron-CC-high-synthetic-distill-low \| 0.1672 \|
	\| Nemotron-CC-high-synthetic-distill-mid \| 0.2339 \|
	\| Nemotron-CC-high-synthetic-diverse_qa_pairs-high \| 0.5397 \|
	\| Nemotron-CC-high-synthetic-diverse_qa_pairs-low \| 0.4064 \|
	\| Nemotron-CC-high-synthetic-diverse_qa_pairs-mid \| 0.5005 \|
	\| Nemotron-CC-high-synthetic-extract_knowledge-high \| 0.4616 \|
	\| Nemotron-CC-high-synthetic-extract_knowledge-low \| 0.0670 \|
	\| Nemotron-CC-high-synthetic-extract_knowledge-mid \| 0.3429 \|
	\| Nemotron-CC-high-synthetic-knowledge_list-high \| 0.2610 \|
	\| Nemotron-CC-high-synthetic-knowledge_list-low \| 0.1824 \|
	\| Nemotron-CC-high-synthetic-knowledge_list-mid \| 0.2313 \|
	\| Nemotron-CC-high-synthetic-wrap_medium-high \| 0.8237 \|
	\| Nemotron-CC-high-synthetic-wrap_medium-low \| 0.2866 \|
	\| Nemotron-CC-high-synthetic-wrap_medium-mid \| 0.6670 \|
	\| Nemotron-CC-low-synthetic-wrap_medium-high \| 0.4657 \|
	\| Nemotron-CC-low-synthetic-wrap_medium-low \| 0.2005 \|
	\| Nemotron-CC-low-synthetic-wrap_medium-mid \| 0.4317 \|
	\| Nemotron-CC-medium-actual-actual-high \| 1.1397 \|
	\| Nemotron-CC-medium-actual-actual-low \| 0.6782 \|
	\| Nemotron-CC-medium-actual-actual-mid \| 0.9175 \|
	\| arxiv \| 0.6414 \|
	\| books \| 0.4696 \|
	\| code-high \| 1.0102 \|
	\| code-low \| 1.1403 \|
	\| code-mid \| 0.9674 \|
	\| cot_synthesis2_CC-high \| 0.3755 \|
	\| cot_synthesis2_CC-low \| 0.0499 \|
	\| cot_synthesis2_CC-mid \| 1.8299 \|
	\| cot_synthesis2_OpenSource-high \| 0.2573 \|
	\| cot_synthesis2_OpenSource-low \| 0.1638 \|
	\| cot_synthesis2_OpenSource-mid \| 0.3251 \|
	\| cot_synthesis2_arxiv-high \| 6.0237 \|
	\| cot_synthesis2_arxiv-low \| 8.9063 \|
	\| cot_synthesis2_arxiv-mid \| 10.1376 \|
	\| cot_synthesis2_code-high \| 0.4598 \|
	\| cot_synthesis2_code-low \| 0.6857 \|
	\| cot_synthesis2_code-mid \| 0.8990 \|
	\| cot_synthesis2_math-high \| 1.3135 \|
	\| cot_synthesis2_math-low \| 1.6530 \|
	\| cot_synthesis2_math-mid \| 0.3536 \|
	\| cot_synthesis2_wiki-high \| 0.6314 \|
	\| cot_synthesis2_wiki-low \| 0.5978 \|
	\| cot_synthesis2_wiki-mid \| 0.7909 \|
	\| cot_synthesis_CC-high \| 0.2225 \|
	\| cot_synthesis_CC-low \| 0.1797 \|
	\| cot_synthesis_CC-mid \| 0.2042 \|
	\| cot_synthesis_OpenSource-high \| 0.4081 \|
	\| cot_synthesis_OpenSource-low \| 0.1659 \|
	\| cot_synthesis_OpenSource-mid \| 1.2828 \|
	\| cot_synthesis_arxiv-high \| 5.68 \|
	\| cot_synthesis_arxiv-low \| 7.4907 \|
	\| cot_synthesis_arxiv-mid \| 8.9359 \|
	\| cot_synthesis_code-high \| 0.7663 \|
	\| cot_synthesis_code-low \| 0.4052 \|
	\| cot_synthesis_code-mid \| 0.1916 \|
	\| cot_synthesis_math-high \| 0.5074 \|
	\| cot_synthesis_math-low \| 0.6437 \|
	\| cot_synthesis_math-mid \| 0.6406 \|
	\| cot_synthesis_wiki-high \| 0.4000 \|
	\| cot_synthesis_wiki-low \| 0.3564 \|
	\| cot_synthesis_wiki-mid \| 0.5768 \|
	\| math-high \| 1.8165 \|
	\| math-low \| 1.6940 \|
	\| math-mid \| 1.6311 \|
	\| pes2o \| 6.1982 \|
	\| pes2o-full-train \| 1.4257 \|
	\| pes2o-full-val \| 0.0143 \|
	\| stack \| 0.4229 \|
	\| wiki \| 0.4202 \|
	\| zh_cc-high-loss0 \| 1.8171 \|
	\| zh_cc-high-loss1 \| 0.9776 \|
	\| zh_cc-high-loss2 \| 0.3725 \|
	\| zh_cc-medium-loss0 \| 0.9492 \|
	\| zh_cc-medium-loss1 \| 0.9236 \|
	\| zh_cc-medium-loss2 \| 1.0643 \|

	## Wandb
	Our training curves have been recorded in Weights & Biases [wandb](https://wandb.ai/openseek-team/OpenSeek-Small-v1-Baseline).

	## Evalation
	We used the LightEval library for model evaluation, following the same setup as in FineWeb and CCI3-HQ.
	All evaluations were conducted in a zero-shot setting. To directly compare the performance across different datasets, we use Average, which refers to the overall average score across all Chinese and English benchmarks.
	\| Metrics \| Score \|
	\|-------------------\|---------\|
	\| HellaSwag \| 42.09 \|
	\| ARC (Average) \| 40.11 \|
	\| PIQA \| 67.14 \|
	\| MMLU (cloze) \| 31.29 \|
	\| CommonsenseQA \| 28.17 \|
	\| TriviaQA \| 6.51 \|
	\| Winograde \| 51.38 \|
	\| OpenBookQA \| 33.00 \|
	\| GSM8K (5-shot) \| 6.67 \|
	\| SIQA \| 41.86 \|
	\| CEval \| 30.19 \|
	\| CMMLU \| 30.25 \|
	\| Average-English \| 34.82 \|
	\| Average-Chinese \| 30.22 \|
	\| Overall Average \| 32.52 \|

	## Usage Instructions
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained("BAAI/OpenSeek-Small-v1-Baseline")
	tokenizer = AutoTokenizer.from_pretrained("BAAI/OpenSeek-Small-v1-Baseline")

	inputs = tokenizer("The future of AI is", return_tensors="pt")
	outputs = model.generate(**inputs, max_length=50)
	print(tokenizer.decode(outputs[0]))
	```

	# OpenSeek-Small-v1-Baseline Model Documentation

	## Overview
	We sampled 100 billion tokens from the CCI4.0 dataset and trained a 1.4B-parameter MoE model with 0.4B active parameters. This model, along with the dataset, is open-sourced as a baseline for future experiments in areas such as dataset construction, algorithmic strategies, and parallel training frameworks. The model arch is same as OpenSeek-Small-v1 model.

	## Training Data
	The ratio for each domain is as follows:

	\| Name \| Ratio \|
	\|-------------------------------------------\|---------\|
	\| Nemotron-CC-high-actual-actual-high \| 1.1068 \|
	\| Nemotron-CC-high-actual-actual-low \| 0.3577 \|
	\| Nemotron-CC-high-actual-actual-mid \| 0.7775 \|
	\| Nemotron-CC-high-synthetic-distill-high \| 0.2859 \|
	\| Nemotron-CC-high-synthetic-distill-low \| 0.1672 \|
	\| Nemotron-CC-high-synthetic-distill-mid \| 0.2339 \|
	\| Nemotron-CC-high-synthetic-diverse_qa_pairs-high \| 0.5397 \|
	\| Nemotron-CC-high-synthetic-diverse_qa_pairs-low \| 0.4064 \|
	\| Nemotron-CC-high-synthetic-diverse_qa_pairs-mid \| 0.5005 \|
	\| Nemotron-CC-high-synthetic-extract_knowledge-high \| 0.4616 \|
	\| Nemotron-CC-high-synthetic-extract_knowledge-low \| 0.0670 \|
	\| Nemotron-CC-high-synthetic-extract_knowledge-mid \| 0.3429 \|
	\| Nemotron-CC-high-synthetic-knowledge_list-high \| 0.2610 \|
	\| Nemotron-CC-high-synthetic-knowledge_list-low \| 0.1824 \|
	\| Nemotron-CC-high-synthetic-knowledge_list-mid \| 0.2313 \|
	\| Nemotron-CC-high-synthetic-wrap_medium-high \| 0.8237 \|
	\| Nemotron-CC-high-synthetic-wrap_medium-low \| 0.2866 \|
	\| Nemotron-CC-high-synthetic-wrap_medium-mid \| 0.6670 \|
	\| Nemotron-CC-low-synthetic-wrap_medium-high \| 0.4657 \|
	\| Nemotron-CC-low-synthetic-wrap_medium-low \| 0.2005 \|
	\| Nemotron-CC-low-synthetic-wrap_medium-mid \| 0.4317 \|
	\| Nemotron-CC-medium-actual-actual-high \| 1.1397 \|
	\| Nemotron-CC-medium-actual-actual-low \| 0.6782 \|
	\| Nemotron-CC-medium-actual-actual-mid \| 0.9175 \|
	\| arxiv \| 0.6414 \|
	\| books \| 0.4696 \|
	\| code-high \| 1.0102 \|
	\| code-low \| 1.1403 \|
	\| code-mid \| 0.9674 \|
	\| cot_synthesis2_CC-high \| 0.3755 \|
	\| cot_synthesis2_CC-low \| 0.0499 \|
	\| cot_synthesis2_CC-mid \| 1.8299 \|
	\| cot_synthesis2_OpenSource-high \| 0.2573 \|
	\| cot_synthesis2_OpenSource-low \| 0.1638 \|
	\| cot_synthesis2_OpenSource-mid \| 0.3251 \|
	\| cot_synthesis2_arxiv-high \| 6.0237 \|
	\| cot_synthesis2_arxiv-low \| 8.9063 \|
	\| cot_synthesis2_arxiv-mid \| 10.1376 \|
	\| cot_synthesis2_code-high \| 0.4598 \|
	\| cot_synthesis2_code-low \| 0.6857 \|
	\| cot_synthesis2_code-mid \| 0.8990 \|
	\| cot_synthesis2_math-high \| 1.3135 \|
	\| cot_synthesis2_math-low \| 1.6530 \|
	\| cot_synthesis2_math-mid \| 0.3536 \|
	\| cot_synthesis2_wiki-high \| 0.6314 \|
	\| cot_synthesis2_wiki-low \| 0.5978 \|
	\| cot_synthesis2_wiki-mid \| 0.7909 \|
	\| cot_synthesis_CC-high \| 0.2225 \|
	\| cot_synthesis_CC-low \| 0.1797 \|
	\| cot_synthesis_CC-mid \| 0.2042 \|
	\| cot_synthesis_OpenSource-high \| 0.4081 \|
	\| cot_synthesis_OpenSource-low \| 0.1659 \|
	\| cot_synthesis_OpenSource-mid \| 1.2828 \|
	\| cot_synthesis_arxiv-high \| 5.68 \|
	\| cot_synthesis_arxiv-low \| 7.4907 \|
	\| cot_synthesis_arxiv-mid \| 8.9359 \|
	\| cot_synthesis_code-high \| 0.7663 \|
	\| cot_synthesis_code-low \| 0.4052 \|
	\| cot_synthesis_code-mid \| 0.1916 \|
	\| cot_synthesis_math-high \| 0.5074 \|
	\| cot_synthesis_math-low \| 0.6437 \|
	\| cot_synthesis_math-mid \| 0.6406 \|
	\| cot_synthesis_wiki-high \| 0.4000 \|
	\| cot_synthesis_wiki-low \| 0.3564 \|
	\| cot_synthesis_wiki-mid \| 0.5768 \|
	\| math-high \| 1.8165 \|
	\| math-low \| 1.6940 \|
	\| math-mid \| 1.6311 \|
	\| pes2o \| 6.1982 \|
	\| pes2o-full-train \| 1.4257 \|
	\| pes2o-full-val \| 0.0143 \|
	\| stack \| 0.4229 \|
	\| wiki \| 0.4202 \|
	\| zh_cc-high-loss0 \| 1.8171 \|
	\| zh_cc-high-loss1 \| 0.9776 \|
	\| zh_cc-high-loss2 \| 0.3725 \|
	\| zh_cc-medium-loss0 \| 0.9492 \|
	\| zh_cc-medium-loss1 \| 0.9236 \|
	\| zh_cc-medium-loss2 \| 1.0643 \|

	## Wandb
	Our training curves have been recorded in Weights & Biases [wandb](https://wandb.ai/openseek-team/OpenSeek-Small-v1-Baseline).

	## Evalation
	We used the LightEval library for model evaluation, following the same setup as in FineWeb and CCI3-HQ.
	All evaluations were conducted in a zero-shot setting. To directly compare the performance across different datasets, we use Average, which refers to the overall average score across all Chinese and English benchmarks.
	\| Metrics \| Score \|
	\|-------------------\|---------\|
	\| HellaSwag \| 42.09 \|
	\| ARC (Average) \| 40.11 \|
	\| PIQA \| 67.14 \|
	\| MMLU (cloze) \| 31.29 \|
	\| CommonsenseQA \| 28.17 \|
	\| TriviaQA \| 6.51 \|
	\| Winograde \| 51.38 \|
	\| OpenBookQA \| 33.00 \|
	\| GSM8K (5-shot) \| 6.67 \|
	\| SIQA \| 41.86 \|
	\| CEval \| 30.19 \|
	\| CMMLU \| 30.25 \|
	\| Average-English \| 34.82 \|
	\| Average-Chinese \| 30.22 \|
	\| Overall Average \| 32.52 \|

	## Usage Instructions
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained("BAAI/OpenSeek-Small-v1-Baseline")
	tokenizer = AutoTokenizer.from_pretrained("BAAI/OpenSeek-Small-v1-Baseline")

	inputs = tokenizer("The future of AI is", return_tensors="pt")
	outputs = model.generate(**inputs, max_length=50)
	print(tokenizer.decode(outputs[0]))
	```