VerIF
Collection
RL trained models and datasets for instruction-following
•
7 items
•
Updated
•
2
This model is trained from DeepSeek-R1-Distill-Qwen-7B using 131k critic data IF-Verifier-Data. This model is used for verifying soft constraints of instruction following. Deploying IF-Verifier-7B requires only one single H800 GPU, with an average reward computation time of 120 seconds per batch, which can be further reduced with multi-GPUs.
The model trained using this model is comparable with that of QwQ 32B.
Please refer to our paper and our GitHub repo (https://github.com/THU-KEG/VerIF) for more details.
If this model helps, please kindly cite us:
@misc{peng2025verif,
title={VerIF: Verification Engineering for Reinforcement Learning in Instruction Following},
author={Hao Peng and Yunjia Qi and Xiaozhi Wang and Bin Xu and Lei Hou and Juanzi Li},
year={2025},
eprint={2506.09942},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.09942},
}