Running 20 20 Rabbits Leaderboard 💊 Visualize and analyze language model robustness to drug name synonyms
Running on CPU Upgrade 13.7k 13.7k Open LLM Leaderboard 🏆 Track, rank and evaluate open LLMs and chatbots
Running on CPU Upgrade 235 235 MMLU-Pro Leaderboard 🥇 More advanced and challenging multi-task evaluation