--- base_model: - ibm-granite/granite-4.0-1b-base license: apache-2.0 library_name: transformers tags: - language - unsloth - granite-4.0 ---
Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants.
| Benchmarks | Metric | 350M Dense | H 350M Dense | 1B Dense | H 1B Dense | ||||
|---|---|---|---|---|---|---|---|---|---|
| General Tasks | |||||||||
| MMLU | 5-shot | 33.08 | 36.07 | 59.82 | 58.71 | ||||
| MMLU-Pro | 5-shot,CoT | 11.29 | 10.08 | 29.96 | 23.45 | ||||
| BBH | 3-shot, CoT | 32.19 | 29.96 | 57.73 | 48.45 | ||||
| AGI EVAL | 3-shot | 28.97 | 29.2 | 48.95 | 47.46 | ||||
| DROP | 5-shot | 29.77 | 28.56 | 58.18 | 57.18 | ||||
| Math Tasks | |||||||||
| GSM8K | 8-shot | 24.11 | 24.41 | 62.4 | 57.39 | ||||
| Minerva Math | 4-shot | 9.96 | 11.5 | 30.3 | 21.3 | ||||
| Code Tasks | |||||||||
| HumanEval | pass@1 [StarCoder Prompt] | 34.6 | 35.61 | 68.08 | 68.26 | ||||
| HumanEval | pass@1 | 32 | 34 | 60 | 59 | ||||
| HumanEval+ | pass@1 | 29 | 29 | 57 | 56 | ||||
| MBPP | pass@1 | 45 | 17 | 72 | 65 | ||||
| MBPP+ | pass@1 | 38 | 16 | 60 | 54 | ||||
| Multilingual Tasks | |||||||||
| MMMLU | 5-shot | 30.93 | 31.02 | 46.73 | 48.55 | ||||
| INCLUDE | 5-shot | 27.32 | 29.26 | 42.6 | 43.8 | ||||
| MGSM | 8-shot | 13.92 | 15.12 | 46.96 | 41.52 | ||||
| Benchmarks | # Langs | Languages |
|---|---|---|
| MMMLU | 11 | ar, de, en, es, fr, ja, ko, pt, zh, bn, hi |
| INCLUDE | 14 | hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh |
| MGSM | 5 | en, es, fr, ja, zh |
| Model | 350M Dense | H 350M Dense | 1B Dense | H 1B Dense |
|---|---|---|---|---|
| Embedding size | 1024 | 768 | 2048 | 1536 |
| Number of layers | 28 attention | 4 attention / 28 Mamba2 | 40 attention | 4 attention / 36 Mamba2 |
| Attention head size | 64 | 64 | 128 | 128 |
| Number of attention heads | 16 | 12 | 16 | 12 |
| Number of KV heads | 4 | 4 | 4 | 4 |
| Mamba2 state size | - | 128 | - | 128 |
| Number of Mamba2 heads | - | 48 | - | 48 |
| MLP / Shared expert hidden size | 2048 | 2048 | 4096 | 4096 |
| Num. Experts | - | - | - | - |
| Num. active Experts | - | - | - | - |
| Expert hidden size | - | - | - | - |
| MLP activation | SwiGLU | SwiGLU | SwiGLU | SwiGLU |
| Sequence length | 32K | 32K | 128K | 128K |
| Position embedding | RoPE | NoPE | RoPE | NoPE |
| # Parameters | 350M | 340M | 1.6B | 1.5B |
| # Active parameters | 350M | 340M | 1.6B | 1.5B |
| Stage | Characteristics | 350M Dense | H 350M Dense | 1B Dense | H 1B Dense |
|---|---|---|---|---|---|
| I | General mixture of training data, warmup, and power scheduler for learning rate. | 10 | 10 | 10 | 10 |
| II | General mixture of training data with higher percentages of code and math with power scheduler for learning rate. | 2 | 2 | 2 | 2 |
| III | High quality training data, exponential decay of learning rate. | 2 | 2 | 2 | 2 |
| IV | High quality training data, linear decay to zero for learning rate. | 0.5 | 0.5 | 0.5 | 0.5 |