Model Description

This model is a ReMax (Reinforcement Learning with Maximization) fine-tune of Qwen/Qwen3-0.6B-Base.

It was aligned using the HelpSteer2 dataset to improve helpfulness and instruction following capabilities. Unlike standard PPO, ReMax eliminates the need for a value model (Critic) and uses a greedy baseline to reduce variance, making it highly efficient for alignment. The goal of the fine-tuning was to improve helpfulness/harmlessness behavior as measured by the HelpSteer2 dataset, while also enabling controlled model diffing experiments as part of the AIPlans research workflow. Developed by: AIPlans

Funded by : AIPlans

Shared by: AIPlans

Model type: Causal decoder-only Transformer (LLM)

Languages: English

Intended Use: Research on model diffing, preference fine-tuning, evaluation of lightweight LLM behavior changes

📊 Evaluation

Below is a comparison between the base model and this ReMax-trained version.

Task	Metric	Base Model	ReMax Model	Change
arc_challenge	acc_norm	0.3840	0.3916	+0.0077
arc_easy	acc_norm	0.5812	0.6709	+0.0896
hellaswag	acc_norm	0.5378	0.5536	+0.0157
truthfulqa_mc2	acc	0.4583	0.4578	-0.0005
winogrande	acc	0.5951	0.5833	-0.0118

⚙️ Training Details

Method: ReMax (Derived from Li et al., 2023)
Base Model: Qwen/Qwen3-0.6B-Base
RM Model Used: https://huggingface.co/AIPlans/Qwen3-0.6B-RM-hs2
SFT Model Used: https://huggingface.co/AIPlans/Qwen3-0.6B-SFT-hs2
Precision: bfloat16 (Training), bfloat16 (Final Weights)
Optimizer: AdamW
Learning Rate: 5e-7
Epochs: 3
Batch Size: 16 (Global Effective: 32)
Hardware: NVIDIA A100 (80GB)

Algorithm Highlights

No Value Model: Reduces memory usage by ~50% compared to PPO.
Greedy Baseline: Uses the model's own greedy generation as the baseline for advantage calculation.
Proxy KL: Uses log-probability difference for stability.

💻 Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "sorakritt/qwen3-0.6b-ReMax-hs2"

model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.bfloat16, 
    device_map="auto", 
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

prompt = "User: How do I make a cake?\n\nAssistant:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=128, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Model Card Author

Premanand Jena - AIPlans Research Intern, Contact : premjena07@gmail.com

Citation

ReMax paper :

@article{li2023remax,
  title     = {ReMax: A Simple, Effective, and Efficient Method for Aligning Large Language Models},
  author    = {Li, Ziniu and Xu, Tian and Zhang, Yushun and Yu, Yang and Sun, RUoyu and Luo, Zhi-Quan},
  booktitle = {arXiv preprint arXiv:2310.10505},
  year      = {2023},
}

TRL :

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}