Reward Model Environment
An environment that uses an external reward model hosted via vLLM to train LLMs. This environment communicates with a reward model API, formats conversations using chat templates, batches requests for efficiency, and includes retry logic for robustness.
Features
- External Reward Model Integration: Connects to reward models hosted via vLLM's
/classifyendpoint - Automatic Model Discovery: Fetches the reward model name from
/v1/models - Batched Requests: Sends all rollouts in a single batch request for efficiency
- Retry Logic: Automatically retries failed requests with exponential backoff
- Chat Template Support: Properly formats conversations using tokenizer chat templates
- Sanity Checks: Logs statistics and warnings for reward values to ensure proper scaling
Installation
uv run vf-install reward-model-env
Usage
Basic Example
import verifiers as vf
# Load the environment
vf_env = vf.load_environment(
"reward-model-env",
dataset_name="gsm8k", # HF dataset with 'prompt' or 'question' column
dataset_config="main", # Optional: dataset config name (required for some datasets)
reward_model_url="http://localhost:8002", # URL where your reward model is hosted
tokenizer_path="./tokenizer.json", # Optional: path to tokenizer for chat template
num_train_examples=100, # Optional: limit training examples
)
# Evaluate with an OpenAI-compatible model
from openai import AsyncOpenAI
results = await vf_env.evaluate(
client=AsyncOpenAI(base_url="http://localhost:8000/v1"),
model="your-model",
num_examples=10,
rollouts_per_example=1,
)
See example.py for a complete working example.
Environment Variables
Set REWARD_MODEL_URL to avoid passing it as an argument:
export REWARD_MODEL_URL="http://localhost:8002"
Reward Model Setup
This environment expects a reward model hosted via vLLM with the classification API enabled. Example setup:
# Start vLLM with a reward model
vllm serve Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 \
--port 8002 \
--enable-classification
API Format
The environment expects the following API endpoints:
/v1/models (GET)
Returns available models:
{
"data": [
{"id": "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2"}
]
}
/classify (POST)
Request:
{
"model": "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2",
"input": [
"<s>[INST]question[/INST]answer</s>"
]
}
Response:
{
"data": [
{
"index": 0,
"label": "LABEL_0",
"probs": [0.85],
"num_classes": 1
}
]
}
The probs[0] value is used as the reward.
Chat Template Formatting
The environment properly formats multi-turn conversations for the reward model:
# Input conversation
[
{"role": "user", "content": "lets do python coding"},
{"role": "assistant", "content": "Sure! How'd you like to get started?"}
]
# Formatted output (using Llama-style template)
"<s>[INST]lets do python coding[/INST]Sure! How'd you like to get started?</s>"
If you provide a tokenizer_path, it will use the tokenizer's native chat template. Otherwise, it falls back to a simple Llama-style format.
Configuration Options
dataset_name(str): Hugging Face dataset namereward_model_url(str): Base URL for the reward model APIdataset_config(str | None): Dataset config name (e.g., "main" for gsm8k, optional)tokenizer_path(str | None): Path to tokenizer.json for chat template formattingsystem_prompt(str): System prompt for the environment (default: "You are a helpful assistant.")num_train_examples(int): Number of training examples (-1 for all)num_eval_examples(int): Number of eval examples (-1 for all)max_retries(int): Maximum retry attempts for API calls (default: 3)retry_delay(float): Base delay between retries in seconds (default: 1.0)timeout(float): Request timeout in seconds (default: 120.0)
Sanity Checks
The environment includes several sanity checks:
- Reward Range Logging: Logs min, max, mean, and median rewards for each batch
- Small Value Warnings: Warns if rewards are extremely small (< 1e-10) to help detect truncation issues
- Response Validation: Ensures the API response structure is correct and matches the input
Training Example
Use with vf-rl for reinforcement learning:
# configs/rl/reward_model.toml
model = "Qwen/Qwen3-4B-Instruct-2507"
[env]
id = "reward-model-env"
reward_model_url = "http://localhost:8002"
dataset_name = "your-dataset"
tokenizer_path = "./tokenizer.json"
[inference]
gpus = 1
[trainer]
gpus = 1
use_lora = true
learning_rate = 1e-5
max_steps = 100
uv run vf-rl @ configs/rl/reward_model.toml
Troubleshooting
Connection Issues
- Ensure your reward model is running and accessible at the specified URL
- Check firewall settings if connecting to a remote server
- Verify the
/v1/modelsendpoint returns valid data
Reward Scaling
- Check the logged reward statistics to ensure values are in the expected range
- If rewards are too small, they might not provide sufficient training signal
- Consider normalizing or scaling rewards based on your use case
Chat Template Issues
- If using a tokenizer, ensure it has a chat template defined
- The fallback simple formatting works for most Llama-style models
- Check the logged sample conversation to verify formatting is correct