--- license: apache-2.0 tags: - reward-model - rlhf - principle-following - qwen pipeline_tag: text-generation base_model: - Qwen/Qwen3-8B language: - en - zh ---
1Peking University 2WeChat AI 3William & Mary 4Westlake University
ยงWork done during Zhuohao's internship at Pattern Recognition Center, WeChat AI, Tencent Inc; โ Corresponding author.