arxiv:2506.04734

Evaluation is All You Need: Strategic Overclaiming of LLM Reasoning Capabilities Through Evaluation Design

Published on Jun 5

· Submitted by

lincharliesun on Jun 6

Upvote

Authors:

Lin Sun ,

Yongfu Zhu ,

Xiaoqi Jian ,

Abstract

Empirical assessments reveal significant fluctuations in benchmark evaluation results of Deepseek-R1-Distill models, questioning the reliability of claimed performance improvements and advocating for a more rigorous evaluation paradigm.

AI-generated summary

Reasoning models represented by the Deepseek-R1-Distill series have been widely adopted by the open-source community due to their strong performance in mathematics, science, programming, and other domains. However, our study reveals that their benchmark evaluation results are subject to significant fluctuations caused by various factors. Subtle differences in evaluation conditions can lead to substantial variations in results. Similar phenomena are observed in other open-source inference models fine-tuned based on the Deepseek-R1-Distill series, as well as in the QwQ-32B model, making their claimed performance improvements difficult to reproduce reliably. Therefore, we advocate for the establishment of a more rigorous paradigm for model performance evaluation and present our empirical assessments of the Deepseek-R1-Distill series models.

View arXiv page View PDF Add to collection

Community

lincharliesun

Paper author Paper submitter Jun 6

Reasoning models represented by the Deepseek-R1-Distill series have been widely adopted by the open-source community due to their strong performance in mathematics, science, programming, and other domains. The study reveals that their benchmark evaluation results are subject to significant fluctuations caused by various factors. Subtle differences in evaluation conditions can lead to substantial variations in results. Similar phenomena are observed in other open-source inference models fine-tuned based on the Deepseek-R1-Distill series, as well as in the QwQ-32B model, making their claimed performance improvements difficult to reproduce reliably.

sprime01

Jun 7

What are the main takeaways?

lincharliesun

Paper author Paper submitter Jun 7

This study reveals how minor variations in evaluation design —such as dataset versions, seed initialization, instruction placement, option ordering, and Tensor Parallelism—can cause significant fluctuations in benchmark scores for reasoning-focused LLMs like the Deepseek-R1-Distill series.

Key Findings:

Evaluation Conditions Matter : Small changes can lead to performance swings of several percentage points, making model comparisons unreliable.
Seed Initialization & N-Sampling : Results are highly sensitive to random seeds. Using more inference samples (larger N) improves stability.
Dataset Versions : Differences in image handling or formatting (e.g., AIME datasets) affect results by up to 3.9 percentage points .
Option & Answer Bias : In MCQs like GPQA Diamond, answer position and option order can reduce accuracy by over 5 percentage points .
Instruction Placement : Minor impact but still affects reproducibility depending on training alignment.

Issues Identified:
Many open-source models overstate performance gains due to favorable evaluation setups , not actual improvements.
Lack of transparency and standardization leads to unreproducible results .

Recommendations:

Use fixed seeds , clear documentation , and confidence intervals .
Report average performance with statistical stability , not just peak scores.
Adopt standardized evaluation frameworks and disclose all settings.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.04734 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.04734 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.04734 in a Space README.md to link it from this page.