TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them Paper • 2509.21117 • Published Sep 25 • 29
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist Paper • 2407.08733 • Published Jul 11, 2024 • 23