PyTorch
llama
vitiugin commited on
Commit
c1ed007
·
verified ·
1 Parent(s): 1ec30b3

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -0
README.md ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - HPLT/HPLT2.0_cleaned
5
+ - nvidia/Nemotron-CC-v2
6
+ - HuggingFaceTB/finemath
7
+ language:
8
+ - en
9
+ - bg
10
+ - cs
11
+ - da
12
+ - de
13
+ - el
14
+ - et
15
+ - fi
16
+ - fr
17
+ - ga
18
+ - hr
19
+ - hu
20
+ - it
21
+ - lt
22
+ - lv
23
+ - mt
24
+ - nl
25
+ - pl
26
+ - pt
27
+ - ro
28
+ - sk
29
+ - sl
30
+ - es
31
+ - sv
32
+ - ca
33
+ - eu
34
+ - gl
35
+ - bs
36
+ - ka
37
+ - mk
38
+ - sq
39
+ - sr
40
+ - tr
41
+ - uk
42
+ - is
43
+ - 'no'
44
+ ---
45
+ # Model Details
46
+ This is a decoder-only model with approximately 2.15B parameters. The architecture largely follows the Llama design, with the following key hyperparameters:
47
+ - Hidden Size: 2048
48
+ - Attention Heads: 32
49
+ - Layers: 24
50
+ - Sequence Length: 2048
51
+
52
+ # Training Data
53
+ The training data is a diverse and multilingual dataset, combined high-quality English, code, math, and European language corpora. The total token budget for training is 4 trillion tokens. The training mixture is comprised of the following datasets:
54
+ - English: A mixture of Nemotron-CC high-actual and medium-high-actual datasets.
55
+ - Code: The StarCoder dataset.
56
+ - Math: The FineMath 4+ dataset.
57
+ - Multilingual: The cleaned version of HPLT 2.0, covering 36 official EU and partner languages.
58
+
59
+ The final data split is based on a predefined proportion of English, code, and math, with the remaining token budget allocated to other languages. A minimum proportion for each language in a mix is 0.05% (with the exception of nno_Latn, which is combined with nob_Latn for proportion calculation).
60
+
61
+ ![detailed_data_5050](https://cdn-uploads.huggingface.co/production/uploads/618bf745f723a0c1e7f2ce6d/MqClllN89k8QeP2cSBw8Z.png)
62
+
63
+ # Tokenizer
64
+ The model utilizes Gemma-3 tokenizer — a SentencePiece tokenizer with a 262K vocabulary size. It supports over 140 languages, which contributed to the model's multilingual performance.
65
+
66
+ # Training Information
67
+ The model was trained using the Megatron-LM framework on the LUMI HPC supercomputer. The training utilized 64 AMD MI250x nodes, totaling approximately 165000 GPU hours.
68
+ Intermediate Checkpoints
69
+ We have released intermediate checkpoints to provide access to the model's training progression. These checkpoints are available in separate branches, with a new checkpoint released every 5000 training steps.
70
+
71
+ The naming convention is `checkpoint_0xxxxx00`. For example, the checkpoint for 50000 iterations is named `checkpoint_0050000`. The available checkpoints range from `checkpoint_0005000` up to `checkpoint_0953675`. The final checkpoint, `checkpoint_0953675`, is located in the main branch.