FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents Paper β’ 2504.13128 β’ Published Apr 17 β’ 5
Chatbot Arena Meets Nuggets: Towards Explanations and Diagnostics in the Evaluation of LLM Responses Paper β’ 2504.20006 β’ Published Apr 28
Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval Paper β’ 2505.16967 β’ Published May 22 β’ 23
Running on CPU Upgrade 149 149 LLM Hallucination Leaderboard π Generate interactive React app data visualizations