---
language:
- en
license: gemma
library_name: transformers
pipeline_tag: text-generation
base_model: google/gemma-3-270m
tags:
- gemma3
- trl
- grpo
- rlhf
- sft
- math
- reasoning
- chain-of-thought
- experimental
- colab
- kaggle
---

# Gemma-3-270M GRPO (math + CoT)

<div style="border-left:6px solid #facc15; background:#fffbe6; padding:12px; border-radius:6px; margin:12px 0;">
<strong>⚠️ Experimental training run.</strong> This is a personal experiment and <strong>not production-ready</strong>. The current checkpoint is from ~<strong>1200 GRPO steps</strong>; accuracy is unstable and may be low. Prompt format and weights may change. Please evaluate carefully before any use.
</div>

Small reasoning-tuned variant of **Google’s Gemma-3-270M**.  
Two-stage recipe: **SFT** on math prompts with hidden `<think>…</think>` reasoning, then **GRPO** to reinforce *correct final answers* while *discouraging overly long hidden reasoning*.  
**Note:** there is **no `<final>` tag** in this project. The final answer is emitted as `\boxed{...}` after the `<think>` block.

> The model can emit `<think>…</think>` tokens. Examples below **strip** this by default.

---

## ✨ What’s inside

- **Base**: `google/gemma-3-270m`
- **Objective**:
  - **SFT**: learn prompt format + produce a boxed final answer.
  - **GRPO**: reward `= 1.0` if the boxed answer matches ground truth (numeric or `sympy`-equivalent), else `0.0`, **minus** a small penalty proportional to tokens inside `<think>…</think>`. KL regularization to the SFT reference.

---

## 🧠 Prompt & output format

Training/eval wrapper (no `<final>` tag):

```text
<prompt>
<YOUR_QUESTION_HERE>
</prompt>
<think>
…(internal scratch work)…
</think>
\boxed{FINAL_ANSWER}
```

- SFT builder resembled: `format_sft_example(question, reasoning, final_answer)`
- RL prompts use only: `<prompt>…</prompt>\n<think>\n` and expect the model to write reasoning + `\boxed{...}`.

---

## 🚀 Quickstart

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch, re

MODEL_ID = "nirav-madhani/gemma3-270m-grpo-math"  # change if different

tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
if tok.pad_token is None:
    tok.pad_token = tok.eos_token

device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.float32,
    low_cpu_mem_usage=True,
    attn_implementation="eager",
).to(device).eval()

BOX_RE = re.compile(r"\\boxed\\{([^{}]+)\\}")

def generate(question, max_new_tokens=160, temperature=0.2, top_p=0.95, return_boxed=True, show_think=False):
    prompt = f"<prompt>\n{question}\n</prompt>\n<think>\n"
    inputs = tok(prompt, return_tensors="pt").to(device)
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=temperature > 0,
            temperature=temperature,
            top_p=top_p,
            eos_token_id=tok.eos_token_id,
            pad_token_id=tok.pad_token_id,
            use_cache=True,
        )
    text = tok.decode(out[0], skip_special_tokens=True)
    if not show_think:
        text = re.sub(r"<think>.*?</think>\s*", "", text, flags=re.S)
    if return_boxed:
        m = BOX_RE.search(text)
        return m.group(1).strip() if m else text.strip()
    return text

print(generate("If 3x + 5 = 17, what is x?"))
```

---

## 🏗️ Training recipe (summary)

**Stage 1 — SFT**

- `transformers==4.55.x`, `trl==0.21.0`
- Tokenizer: Gemma-3 fast; `pad_token = eos_token`
- `Trainer` + `DataCollatorForLanguageModeling`
- `max_seq_length = prompt_len + completion_len`; small per-device batch with grad accumulation; BF16/FP16 on Ampere, else FP32.

**Stage 2 — RL (GRPO)**

- TRL **GRPO** (`trl==0.21.0`)
- Policy & reference initialized from **SFT**
- **Reward**:
  - `+1.0` if `\boxed{...}` equals ground truth (float tolerance or `sympy` equivalence)
  - `−λ * (#tokens inside <think>…</think>)`
- Knobs (fit ~15 GB VRAM; tune per GPU):
  - `num_generations (K)`: 2–8
  - `per_device_train_batch_size`: 1–2
  - `gradient_accumulation_steps`: 4–8  
    (*ensure* `batch * accum * world_size` is **divisible by K**)
  - `max_prompt_length`: ~160–256
  - `max_completion_length`: ~128–192
  - `beta` (KL): ~0.02
  - `attn_implementation="eager"`; enable `use_cache=True` if you have headroom

**Checkpointing**

- Checkpoints saved every *N* steps; keep last 3; persisted to Google Drive or `/kaggle/working`.
- Inference loader grabs **newest `checkpoint-XXXX/`**, else RL root → SFT → base.

---

## 📊 Evaluation

*Early-stage.* Evaluate on your math split or GSM-style test by extracting `\boxed{…}` and checking numeric or `sympy`-equivalence. Track:
- reward mean/std, exact-match of final answers,
- KL vs. reference,
- output length and “think” token counts.

---

## ⚖️ License & usage

- **Base license**: Gemma models are under the **Gemma license**. This derivative remains subject to those terms.
- Repo metadata sets `license: gemma`. Review Gemma’s terms before commercial use/redistribution.

---

## 🔒 Limitations & risks

- 270M params is **very small**; expect brittleness outside narrow math tasks.
- Hidden reasoning can be wrong; we **hide** it by default.
- No built-in safety filtering.

---

## 🧩 Repro notes

- Colab/Kaggle friendly; use `attn_implementation="eager"` to avoid FA mismatches.
- GRPO progress bar’s “Training Loss” can be `0.0` — monitor **reward/KL/length**.
- Env tips:
  - set `TRANSFORMERS_NO_TORCHVISION=1`,
  - ensure compatible `numpy`/`scikit-learn` on Kaggle if `transformers.generation` pulls sklearn.

---

## 🙌 Acknowledgements

- Base weights: **Google** `google/gemma-3-270m`
- RL training: **TRL** (`trl==0.21.0`)

---

## 📣 Citation

```bibtex
@software{nirav_gemma3_270m_grpo_math_2025,
  title   = {Gemma-3-270M GRPO (math + CoT)},
  author  = {Nirav Madhani},
  year    = {2025},
  url     = {https://huggingface.co/nirav-madhani/gemma3-270m-grpo-math}
}
```