|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- HPLT/HPLT2.0_cleaned |
|
|
- nvidia/Nemotron-CC-v2 |
|
|
- HuggingFaceTB/finemath |
|
|
- bigcode/starcoderdata |
|
|
language: |
|
|
- en |
|
|
- bg |
|
|
- cs |
|
|
- da |
|
|
- de |
|
|
- el |
|
|
- et |
|
|
- fi |
|
|
- fr |
|
|
- ga |
|
|
- hr |
|
|
- hu |
|
|
- it |
|
|
- lt |
|
|
- lv |
|
|
- mt |
|
|
- nl |
|
|
- pl |
|
|
- pt |
|
|
- ro |
|
|
- sk |
|
|
- sl |
|
|
- es |
|
|
- sv |
|
|
- ca |
|
|
- eu |
|
|
- gl |
|
|
- bs |
|
|
- ka |
|
|
- mk |
|
|
- sq |
|
|
- sr |
|
|
- tr |
|
|
- uk |
|
|
- is |
|
|
- 'no' |
|
|
--- |
|
|
# Model Details |
|
|
This is a decoder-only model with approximately 2.15B parameters. The architecture largely follows the Llama design, with the following key hyperparameters: |
|
|
- Hidden Size: 2048 |
|
|
- Attention Heads: 32 |
|
|
- Layers: 24 |
|
|
- Sequence Length: 2048 |
|
|
|
|
|
# Training Data |
|
|
The training data is a diverse and multilingual dataset, combined high-quality English, code, math, and European language corpora. The total token budget for training is 4 trillion tokens. The training mixture is comprised of the following datasets: |
|
|
- English: A mixture of Nemotron-CC high-actual and medium-high-actual datasets. |
|
|
- Code: The StarCoder dataset. |
|
|
- Math: The FineMath 4+ dataset. |
|
|
- Multilingual: The cleaned version of HPLT 2.0, covering 36 official EU and partner languages. |
|
|
|
|
|
The final data split is based on a predefined proportion of English, code, and math, with the remaining token budget allocated to other languages. A minimum proportion for each language in a mix is 0.05% (with the exception of nno_Latn, which is combined with nob_Latn for proportion calculation). |
|
|
|
|
|
 |
|
|
|
|
|
# Tokenizer |
|
|
The model utilizes Gemma-3 tokenizer — a SentencePiece tokenizer with a 262K vocabulary size. It supports over 140 languages, which contributed to the model's multilingual performance. |
|
|
|
|
|
# Training Information |
|
|
The model was trained using the Megatron-LM framework on the LUMI HPC supercomputer. The training utilized 64 AMD MI250x nodes, totaling approximately 165000 GPU hours. |
|
|
Intermediate Checkpoints |
|
|
We have released intermediate checkpoints to provide access to the model's training progression. These checkpoints are available in separate branches, with a new checkpoint released every 5000 training steps. |
|
|
|
|
|
The naming convention is `checkpoint_0xxxxx00`. For example, the checkpoint for 50000 iterations is named `checkpoint_0050000`. The available checkpoints range from `checkpoint_0005000` up to `checkpoint_0953675`. The final checkpoint, `checkpoint_0953675`, is located in the main branch. |