Models That Know How Evaluations Are Designed Score Safer
Abstract
Fine-tuning models on synthetic documents describing evaluation traits improves safety benchmark performance by enabling implicit recognition of evaluation-like contexts, independent of memorization or explicit awareness.
The validity of AI safety evaluations depends on models behaving consistently across controlled and deployment settings. Prior work has identified test-time contextual cues, such as hypothetical scenarios, as a source of verbalized evaluation awareness and subsequent behavioral shift. In this paper, we investigate a potential explanation of this phenomenon: evaluation meta-knowledge, defined as parametric knowledge about the structural traits that characterize evaluations. Similar to dataset contamination, where benchmark exposure leads to higher performance through memorization, we hypothesize that models trained on texts describing evaluation practices may implicitly learn to recognize and respond to evaluation-like contexts, for instance, through exposure to scientific articles or social media posts about AI benchmarking. To test this, we fine-tune models on synthetic documents describing evaluation traits such as verifiable structures or moral dilemmas. Evaluating this fine-tuned model on six safety benchmarks, we find that it is significantly safer than the base model and control model. This behavioral shift persists even when restricting the analysis to responses lacking explicit verbalization of evaluation awareness. Our results demonstrate that evaluation meta-knowledge may inflate safety benchmark performance, introducing a novel confounder that is independent of explicit memorization or verbalized evaluation awareness, thus, challenging to detect. These findings have important implications for the design and interpretation of AI safety evaluations. Our code and models are available at https://github.com/compass-group-tue/arxiv2026_evaluation_meta_knowledge.
Community
Parametric knowledge of evaluation traits implicitly shifts LLMs toward safer behavior on safety benchmarks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness (2026)
- Evaluation Awareness in Language Models Has Limited Effect on Behaviour (2026)
- Internalizing Safety Understanding in Large Reasoning Models via Verification (2026)
- Characterizing the Consistency of the Emergent Misalignment Persona (2026)
- Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost (2026)
- How Sensitive Are Safety Benchmarks to Judge Configuration Choices? (2026)
- Automatically Finding and Validating Unexpected Side-Effects of Interventions on Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Made an audio walkthrough of this paper for anyone who wants to skim it on the go:
https://researchpod.app/episode/3a7cef01-8f70-4f1a-b50c-33c50da646cd
Generated automatically by ResearchPod — happy to take feedback from the authors.
Thank you! Really cool!
Get this paper in your agent:
hf papers read 2605.28591 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 4
compass-group-tue/nemotron-6-traits
Datasets citing this paper 1
compass-group-tue/sdf_evaluation_traits
Spaces citing this paper 0
No Space linking this paper