The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs Paper • 2501.10970 • Published Jan 19, 2025 • 1
LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals Paper • 2601.10700 • Published Jan 15 • 18
Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality Paper • 2602.14080 • Published Feb 15 • 21
A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks Paper • 2605.28556 • Published 7 days ago • 55
A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks Paper • 2605.28556 • Published 7 days ago • 55
Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling Paper • 2605.12411 • Published 22 days ago • 49
MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image Paper • 2605.10616 • Published 23 days ago • 140
Hallucinations Undermine Trust; Metacognition is a Way Forward Paper • 2605.01428 • Published May 2 • 24
Efficient Agent Evaluation via Diversity-Guided User Simulation Paper • 2604.21480 • Published Apr 23 • 15
Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness Paper • 2604.12373 • Published Apr 14 • 9
Alignment Makes Language Models Normative, Not Descriptive Paper • 2603.17218 • Published Mar 17 • 46
Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs Paper • 2603.09906 • Published Mar 10 • 75
Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality Paper • 2602.14080 • Published Feb 15 • 21 • 3
Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality Paper • 2602.14080 • Published Feb 15 • 21
Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality Paper • 2602.14080 • Published Feb 15 • 21
STATe-of-Thoughts: Structured Action Templates for Tree-of-Thoughts Paper • 2602.14265 • Published Feb 15 • 21
LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals Paper • 2601.10700 • Published Jan 15 • 18
LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals Paper • 2601.10700 • Published Jan 15 • 18