Tiny Reasoning Language Model (trlm-135)

image/png

Table of Contents

  1. Model Summary
  2. Post-Training Pipeline
  3. How to use
  4. Training
  5. Evaluation
  6. Limitations
  7. Acknowledgements
  8. License

Model Summary

The Tiny Reasoning Language Model (trlm-135) is a 135M parameter research prototype designed to study how small models can learn step-by-step reasoning. It was built on top of SmolLM2-135M-Instruct and fine-tuned through a 3-stage pipeline:

The code for everything can be found here


Post-Training Pipeline

image

How to use

pip install -U transformers accelerate
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Shekswess/trlm-135m"
device = "cuda"  # or "cpu"

# Load tokenizer & model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
).to(device)

# Example prompt
prompt = "Give me a brief explanation of gravity in simple terms."
messages = [
    {"role": "user", "content": prompt}
]

# Apply chat template
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Generate
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

For reasoning-heavy tasks, set temperature=0.6 and top_p=0.95.


Training

Model

  • Architecture: Decoder-only transformer (SmolLM2 backbone which infact is Llama 3 based model).
  • Parameters: ~135M.
  • Precision: mix-precision (bfloat16) during training.

Software & Hardware

  • Training Frameworks: PyTorch (ROCm), Hugging Face Transformers & TRL.
  • Hardware: AMD MI300X (192GB VRAM, 224GB RAM).

Special thanks to @HotAisle

Training Stages

  1. Stage 1 โ€“ SFT (non-reasoning)
    • ~58k samples, everyday conversations & instruction following.
  2. Stage 2 โ€“ SFT (reasoning)
    • ~78k samples with <think> segments.
  3. Stage 3 โ€“ DPO (alignment)
    • ~50k preference pairs (chosen vs. rejected reasoning traces).

Evaluation

Evaluation was done with lm-eval-harness:

Benchmark Tiny Reasoning Language Model (trlm-135M) SmolLM2-135M-Instruct Improvements
ARC Challenge 40.61 (avg) 37.3 (avg) +3.31
BBH 36.80 (3-shot) 28.2 (3-shot) +8.6
BoolQ 62.17 โ€“ N/A
GSM8K 2.59 (5-shot) 1.4 (5-shot) +1.19
IFEval 35.49 (avg) 29.9 (avg) +5.59
MMLU 34.95 29.3 +5.65
PIQA 64.91 66.3 โ€“1.39
HellaSwag โ€“ 40.9 N/A
MT-Bench โ€“ 19.8 N/A

Limitations

  • Not production-ready: hallucinations and logical errors are frequent.
  • Small size: limited general knowledge and reasoning depth.
  • English-only: multilingual capabilities not explored.

Acknowledgements

  • @HotAisle for providing the compute resources to train all three stages on a awesome AMD MI300x setup.
  • @mkurman88 for ideas, feedback and code samples.
  • HuggingFaceTB team for SmolLM2-135M-Instruct model and the Smoltalk2 dataset collection.
  • @scottgeng00 for the OLmO-3-Preference-Mix-Deltas dataset.
  • @eliebakouchi for help with the tokenization.

License

Apache 2.0


Downloads last month
98
Safetensors
Model size
0.1B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ 2 Ask for provider support

Model tree for Shekswess/trlm-135m

Space using Shekswess/trlm-135m 1

Collection including Shekswess/trlm-135m