When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity Paper • 2509.20293 • Published Sep 24 • 7 • 3
When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity Paper • 2509.20293 • Published Sep 24 • 7 • 3
MARVIS: Modality Adaptive Reasoning over VISualizations Paper • 2507.01544 • Published Jul 2 • 13 • 1
SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Classification Paper • 2410.05057 • Published Oct 7, 2024 • 7 • 2
Style over Substance: Failure Modes of LLM Judges in Alignment Benchmarking Paper • 2409.15268 • Published Sep 23, 2024 • 13 • 2