YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Usage
This model outputs a reward for each reasoning step evaluating it.
Babelscape/Qwen2.5-Math-7B-PRM800k-PDDL-r is a Process Reward Model (PRM) based on Qwen2.5-Math-7B-Instruct. It is trained with process-supervision data from PRM800K and with the planning-based supervision introduced in PDDL2PRM.
PDDL2PRM is the dataset introduced in:
Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards
Raffaele Pisano and Roberto Navigli, ACL 2026
Project page & paper: https://babelscape.github.io/prm-meets-planning/
arXiv: https://arxiv.org/abs/2604.17957
The paper proposes using symbolic planning problems written in Planning Domain Definition Language (PDDL) to generate precise step-level rewards for reasoning trajectories. In PDDL, actions, states, preconditions, effects, and goals are explicitly defined, so intermediate reasoning steps can be evaluated automatically.
Example
import torch
from transformers import AutoTokenizer, AutoModel
repo_id = "rpisano/Qwen2.5-Math-7B-PRM800k-PDDL-r"
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
def build_prompt(problem, steps):
steps_text = "\n".join([f"Step {i+1}: {step}\nки" for i, step in enumerate(steps)])
return f"Problem: {problem}\nSteps:\n{steps_text}"
problem = "If x + 3 = 10, find x."
steps = [
"Subtract 3 from both sides: x = 10 - 3.",
"So x = 7."
]
prompt = build_prompt(problem, steps)
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
pred_scalar = outputs["pred_scalar"]
marker_id = tokenizer.encode("ки", add_special_tokens=False)[0]
marker_positions = (inputs["input_ids"][0] == marker_id).nonzero(as_tuple=True)[0]
step_scores = torch.sigmoid(pred_scalar[0, marker_positions]).cpu().tolist()
print("Step scores:", step_scores)
first_bad = next((i for i, score in enumerate(step_scores) if score < 0.5), -1)
print("First failing step index:", first_bad)
Notes
- The marker "ки" must appear after every reasoning step.
- pred_scalar contains one scalar per token, so only values at marker positions should be used as step scores.
- A threshold such as 0.5 can be used to identify potentially incorrect steps.
Citation
If you use this model or the PDDL2PRM dataset in your work, please cite:
@inproceedings{pisano2026prmplanning,
title={Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards},
author={Pisano, Raffaele and Navigli, Roberto},
booktitle={Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)},
year={2026},
note={Accepted}
}
- Downloads last month
- 20