YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Reward Model Environment

An environment that uses an external reward model hosted via vLLM to train LLMs. This environment communicates with a reward model API, formats conversations using chat templates, batches requests for efficiency, and includes retry logic for robustness.

Features

  • External Reward Model Integration: Connects to reward models hosted via vLLM's /classify endpoint
  • Automatic Model Discovery: Fetches the reward model name from /v1/models
  • Batched Requests: Sends all rollouts in a single batch request for efficiency
  • Retry Logic: Automatically retries failed requests with exponential backoff
  • Chat Template Support: Properly formats conversations using tokenizer chat templates
  • Sanity Checks: Logs statistics and warnings for reward values to ensure proper scaling

Installation

uv run vf-install reward-model-env

Usage

Basic Example

import verifiers as vf

# Load the environment
vf_env = vf.load_environment(
    "reward-model-env",
    dataset_name="gsm8k",  # HF dataset with 'prompt' or 'question' column
    dataset_config="main",  # Optional: dataset config name (required for some datasets)
    reward_model_url="http://localhost:8002",  # URL where your reward model is hosted
    tokenizer_path="./tokenizer.json",  # Optional: path to tokenizer for chat template
    num_train_examples=100,  # Optional: limit training examples
)

# Evaluate with an OpenAI-compatible model
from openai import AsyncOpenAI

results = await vf_env.evaluate(
    client=AsyncOpenAI(base_url="http://localhost:8000/v1"),
    model="your-model",
    num_examples=10,
    rollouts_per_example=1,
)

See example.py for a complete working example.

Environment Variables

Set REWARD_MODEL_URL to avoid passing it as an argument:

export REWARD_MODEL_URL="http://localhost:8002"

Reward Model Setup

This environment expects a reward model hosted via vLLM with the classification API enabled. Example setup:

# Start vLLM with a reward model
vllm serve Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 \
  --port 8002 \
  --enable-classification

API Format

The environment expects the following API endpoints:

/v1/models (GET)

Returns available models:

{
  "data": [
    {"id": "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2"}
  ]
}

/classify (POST)

Request:

{
  "model": "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2",
  "input": [
    "<s>[INST]question[/INST]answer</s>"
  ]
}

Response:

{
  "data": [
    {
      "index": 0,
      "label": "LABEL_0",
      "probs": [0.85],
      "num_classes": 1
    }
  ]
}

The probs[0] value is used as the reward.

Chat Template Formatting

The environment properly formats multi-turn conversations for the reward model:

# Input conversation
[
  {"role": "user", "content": "lets do python coding"},
  {"role": "assistant", "content": "Sure! How'd you like to get started?"}
]

# Formatted output (using Llama-style template)
"<s>[INST]lets do python coding[/INST]Sure! How'd you like to get started?</s>"

If you provide a tokenizer_path, it will use the tokenizer's native chat template. Otherwise, it falls back to a simple Llama-style format.

Configuration Options

  • dataset_name (str): Hugging Face dataset name
  • reward_model_url (str): Base URL for the reward model API
  • dataset_config (str | None): Dataset config name (e.g., "main" for gsm8k, optional)
  • tokenizer_path (str | None): Path to tokenizer.json for chat template formatting
  • system_prompt (str): System prompt for the environment (default: "You are a helpful assistant.")
  • num_train_examples (int): Number of training examples (-1 for all)
  • num_eval_examples (int): Number of eval examples (-1 for all)
  • max_retries (int): Maximum retry attempts for API calls (default: 3)
  • retry_delay (float): Base delay between retries in seconds (default: 1.0)
  • timeout (float): Request timeout in seconds (default: 120.0)

Sanity Checks

The environment includes several sanity checks:

  1. Reward Range Logging: Logs min, max, mean, and median rewards for each batch
  2. Small Value Warnings: Warns if rewards are extremely small (< 1e-10) to help detect truncation issues
  3. Response Validation: Ensures the API response structure is correct and matches the input

Training Example

Use with vf-rl for reinforcement learning:

# configs/rl/reward_model.toml
model = "Qwen/Qwen3-4B-Instruct-2507"

[env]
id = "reward-model-env"
reward_model_url = "http://localhost:8002"
dataset_name = "your-dataset"
tokenizer_path = "./tokenizer.json"

[inference]
gpus = 1

[trainer]
gpus = 1
use_lora = true
learning_rate = 1e-5
max_steps = 100
uv run vf-rl @ configs/rl/reward_model.toml

Troubleshooting

Connection Issues

  • Ensure your reward model is running and accessible at the specified URL
  • Check firewall settings if connecting to a remote server
  • Verify the /v1/models endpoint returns valid data

Reward Scaling

  • Check the logged reward statistics to ensure values are in the expected range
  • If rewards are too small, they might not provide sufficient training signal
  • Consider normalizing or scaling rewards based on your use case

Chat Template Issues

  • If using a tokenizer, ensure it has a chat template defined
  • The fallback simple formatting works for most Llama-style models
  • Check the logged sample conversation to verify formatting is correct
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support