nanochat-d34-rl-all-ckpts
This is an RL-trained version of nanochat-d34-finetuned, fine-tuned using GRPO (Group Relative Policy Optimization) on GSM8K math problems, with all checkpoints for all steps,
Model Description
- Base Model: karpathy/nanochat-d34 (2.2B parameters)
- SFT Model: pankajmathur/nanochat-d34-finetuned
- Architecture: GPT-style transformer with depth=34
- Training Pipeline: Pre-training β Mid-training β SFT β RL (GRPO)
- Hardware: 8x NVIDIA A100-SXM4-80GB GPUs
Key Achievement: GSM8K +73.6% Improvement
The RL training significantly boosted math reasoning capabilities while maintaining general performance:
| Metric | MID | SFT | RL | Change (SFTβRL) |
|---|---|---|---|---|
| GSM8K | 0.1137 | 0.1327 | 0.2305 | +73.6% |
| ARC-Easy | 0.6961 | 0.7210 | 0.7130 | -1.1% |
| ARC-Challenge | 0.5367 | 0.5418 | 0.5375 | -0.8% |
| MMLU | 0.4229 | 0.4304 | 0.4256 | -1.1% |
| HumanEval | 0.1098 | 0.1037 | 0.0671 | -35.3% |
| SpellingBee | - | - | 0.9922 | N/A |
| ChatCORE | 0.4045 | 0.4157 | 0.4208 | +1.2% |
Training Details
RL Configuration (GRPO)
- Run: d34_rl
- Source: SFT checkpoint
- dtype: bfloat16
- device_batch_size: 4
- examples_per_step: 16
- num_samples: 16
- max_new_tokens: 256
- temperature: 1.0
- top_k: 50
- Learning Rates:
- unembedding_lr: 0.0040
- embedding_lr: 0.2000
- matrix_lr: 0.0200
- weight_decay: 0.0
- num_epochs: 1
- Total Steps: 467
Training Metrics (Final)
- Pass@1: 0.2300
- Pass@2: 0.2750
- Pass@3: 0.3275
- Pass@4: 0.3675
- Average Reward: ~0.28
- Average Sequence Length: ~178 tokens
Repository Structure
βββ tokenizer/
β βββ tokenizer.pkl # Tokenizer
β βββ token_bytes.pt # Token byte mappings
βββ chatrl_checkpoints/d34/ # RL checkpoint
β βββ model_000466.pt # Final model weights
β βββ meta_000466.json # Training metadata
βββ report/ # Evaluation reports
β βββ report.md
βββ logs/ # Training logs
WandB Training Run
Related Models
- Base: karpathy/nanochat-d34 - Pre-trained base model
- SFT: pankajmathur/nanochat-d34-finetuned - Mid-training + SFT checkpoint
License
MIT License (same as nanochat)
Acknowledgments
- Andrej Karpathy for the nanochat framework and pre-trained base model
- The nanochat community
@misc{nanochat,
author = {Andrej Karpathy},
title = {nanochat: The best ChatGPT that $100 can buy},
year = {2025},
publisher = {GitHub},
url = {https://github.com/karpathy/nanochat}
}
Model tree for pankajmathur/nanochat-d34-rl-all-ckpts
Base model
karpathy/nanochat-d34
Finetuned
pankajmathur/nanochat-d34-finetuned
