openeurollm
/

datamix-2b-90-10

Model card Files Files and versions

datamix-2b-90-10 / README.md

vitiugin's picture

Create README.md

f293b93 verified 24 days ago

|

history blame contribute delete

2.43 kB

	---
	license: apache-2.0
	datasets:
	- HPLT/HPLT2.0_cleaned
	- nvidia/Nemotron-CC-v2
	- HuggingFaceTB/finemath
	- bigcode/starcoderdata
	language:
	- en
	- bg
	- cs
	- da
	- de
	- el
	- et
	- fi
	- fr
	- ga
	- hr
	- hu
	- it
	- lt
	- lv
	- mt
	- nl
	- pl
	- pt
	- ro
	- sk
	- sl
	- es
	- sv
	- ca
	- eu
	- gl
	- bs
	- ka
	- mk
	- sq
	- sr
	- tr
	- uk
	- is
	- 'no'
	---
	# Model Details
	This is a decoder-only model with approximately 2.15B parameters. The architecture largely follows the Llama design, with the following key hyperparameters:
	- Hidden Size: 2048
	- Attention Heads: 32
	- Layers: 24
	- Sequence Length: 2048

	# Training Data
	The training data is a diverse and multilingual dataset, combined high-quality English, code, math, and European language corpora. The total token budget for training is 4 trillion tokens. The training mixture is comprised of the following datasets:
	- English: A mixture of Nemotron-CC high-actual and medium-high-actual datasets.
	- Code: The StarCoder dataset.
	- Math: The FineMath 4+ dataset.
	- Multilingual: The cleaned version of HPLT 2.0, covering 36 official EU and partner languages.

	The final data split is based on a predefined proportion of English, code, and math, with the remaining token budget allocated to other languages. A minimum proportion for each language in a mix is 0.05% (with the exception of nno_Latn, which is combined with nob_Latn for proportion calculation).

	![detailed_data_9010](https://cdn-uploads.huggingface.co/production/uploads/618bf745f723a0c1e7f2ce6d/JpJ2MUuSST4RnwOUIiOLB.png)

	# Tokenizer
	The model utilizes Gemma-3 tokenizer — a SentencePiece tokenizer with a 262K vocabulary size. It supports over 140 languages, which contributed to the model's multilingual performance.

	# Training Information
	The model was trained using the Megatron-LM framework on the LUMI HPC supercomputer. The training utilized 64 AMD MI250x nodes, totaling approximately 165000 GPU hours.
	Intermediate Checkpoints
	We have released intermediate checkpoints to provide access to the model's training progression. These checkpoints are available in separate branches, with a new checkpoint released every 5000 training steps.

	The naming convention is `checkpoint_0xxxxx00`. For example, the checkpoint for 50000 iterations is named `checkpoint_0050000`. The available checkpoints range from `checkpoint_0005000` up to `checkpoint_0953675`. The final checkpoint, `checkpoint_0953675`, is located in the main branch.