nanochat-d34-rl-all-ckpts

This is an RL-trained version of nanochat-d34-finetuned, fine-tuned using GRPO (Group Relative Policy Optimization) on GSM8K math problems, with all checkpoints for all steps,

Model Description

Key Achievement: GSM8K +73.6% Improvement

The RL training significantly boosted math reasoning capabilities while maintaining general performance:

Metric MID SFT RL Change (SFT→RL)
GSM8K 0.1137 0.1327 0.2305 +73.6%
ARC-Easy 0.6961 0.7210 0.7130 -1.1%
ARC-Challenge 0.5367 0.5418 0.5375 -0.8%
MMLU 0.4229 0.4304 0.4256 -1.1%
HumanEval 0.1098 0.1037 0.0671 -35.3%
SpellingBee - - 0.9922 N/A
ChatCORE 0.4045 0.4157 0.4208 +1.2%

Training Details

RL Configuration (GRPO)

  • Run: d34_rl
  • Source: SFT checkpoint
  • dtype: bfloat16
  • device_batch_size: 4
  • examples_per_step: 16
  • num_samples: 16
  • max_new_tokens: 256
  • temperature: 1.0
  • top_k: 50
  • Learning Rates:
    • unembedding_lr: 0.0040
    • embedding_lr: 0.2000
    • matrix_lr: 0.0200
  • weight_decay: 0.0
  • num_epochs: 1
  • Total Steps: 467

Training Metrics (Final)

  • Pass@1: 0.2300
  • Pass@2: 0.2750
  • Pass@3: 0.3275
  • Pass@4: 0.3675
  • Average Reward: ~0.28
  • Average Sequence Length: ~178 tokens

Repository Structure

β”œβ”€β”€ tokenizer/
β”‚   β”œβ”€β”€ tokenizer.pkl          # Tokenizer
β”‚   └── token_bytes.pt         # Token byte mappings
β”œβ”€β”€ chatrl_checkpoints/d34/    # RL checkpoint
β”‚   β”œβ”€β”€ model_000466.pt        # Final model weights
β”‚   └── meta_000466.json       # Training metadata
β”œβ”€β”€ report/                    # Evaluation reports
β”‚   └── report.md
└── logs/                      # Training logs

WandB Training Run

Full Report

Related Models

License

MIT License (same as nanochat)

Acknowledgments

  • Andrej Karpathy for the nanochat framework and pre-trained base model
  • The nanochat community
@misc{nanochat,
  author = {Andrej Karpathy},
  title = {nanochat: The best ChatGPT that $100 can buy},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/karpathy/nanochat}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for pankajmathur/nanochat-d34-rl-all-ckpts

Finetuned
(3)
this model

Datasets used to train pankajmathur/nanochat-d34-rl-all-ckpts