Model Description
This model is a ReMax (Reinforcement Learning with Maximization) fine-tune of Qwen/Qwen3-0.6B-Base.
It was aligned using the HelpSteer2 dataset to improve helpfulness and instruction following capabilities. Unlike standard PPO, ReMax eliminates the need for a value model (Critic) and uses a greedy baseline to reduce variance, making it highly efficient for alignment. The goal of the fine-tuning was to improve helpfulness/harmlessness behavior as measured by the HelpSteer2 dataset, while also enabling controlled model diffing experiments as part of the AIPlans research workflow. Developed by: AIPlans
Funded by : AIPlans
Shared by: AIPlans
Model type: Causal decoder-only Transformer (LLM)
Languages: English
Intended Use: Research on model diffing, preference fine-tuning, evaluation of lightweight LLM behavior changes
π Evaluation
Below is a comparison between the base model and this ReMax-trained version.
| Task | Metric | Base Model | ReMax Model | Change |
|---|---|---|---|---|
| arc_challenge | acc_norm | 0.3840 | 0.3916 | +0.0077 |
| arc_easy | acc_norm | 0.5812 | 0.6709 | +0.0896 |
| hellaswag | acc_norm | 0.5378 | 0.5536 | +0.0157 |
| truthfulqa_mc2 | acc | 0.4583 | 0.4578 | -0.0005 |
| winogrande | acc | 0.5951 | 0.5833 | -0.0118 |
βοΈ Training Details
- Method: ReMax (Derived from Li et al., 2023)
- Base Model: Qwen/Qwen3-0.6B-Base
- RM Model Used: https://huggingface.co/AIPlans/Qwen3-0.6B-RM-hs2
- SFT Model Used: https://huggingface.co/AIPlans/Qwen3-0.6B-SFT-hs2
- Precision: bfloat16 (Training), bfloat16 (Final Weights)
- Optimizer: AdamW
- Learning Rate: 5e-7
- Epochs: 3
- Batch Size: 16 (Global Effective: 32)
- Hardware: NVIDIA A100 (80GB)
Algorithm Highlights
- No Value Model: Reduces memory usage by ~50% compared to PPO.
- Greedy Baseline: Uses the model's own greedy generation as the baseline for advantage calculation.
- Proxy KL: Uses log-probability difference for stability.
π» Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "sorakritt/qwen3-0.6b-ReMax-hs2"
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
prompt = "User: How do I make a cake?\n\nAssistant:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=128, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Model Card Author
Premanand Jena - AIPlans Research Intern, Contact : premjena07@gmail.com
Citation
ReMax paper :
@article{li2023remax,
title = {ReMax: A Simple, Effective, and Efficient Method for Aligning Large Language Models},
author = {Li, Ziniu and Xu, Tian and Zhang, Yushun and Yu, Yang and Sun, RUoyu and Luo, Zhi-Quan},
booktitle = {arXiv preprint arXiv:2310.10505},
year = {2023},
}
TRL :
@misc{vonwerra2022trl,
title = {{TRL: Transformer Reinforcement Learning}},
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
year = 2020,
journal = {GitHub repository},
publisher = {GitHub},
howpublished = {\url{https://github.com/huggingface/trl}}
}
- Downloads last month
- -
Model tree for AIPlans/Qwen3-0.6B-ReMax
Base model
Qwen/Qwen3-0.6B-Base