-
GAIA: a benchmark for General AI Assistants
Paper • 2311.12983 • Published • 226 -
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Paper • 2311.16502 • Published • 35 -
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 27 -
RULER: What's the Real Context Size of Your Long-Context Language Models?
Paper • 2404.06654 • Published • 39
Collections
Discover the best community collections!
Collections including paper arxiv:2502.13595
-
Self-Boosting Large Language Models with Synthetic Preference Data
Paper • 2410.06961 • Published • 17 -
Qwen2.5 Technical Report
Paper • 2412.15115 • Published • 373 -
SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation
Paper • 2412.13649 • Published • 20 -
NeoBERT: A Next-Generation BERT
Paper • 2502.19587 • Published • 40
-
CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery
Paper • 2406.08587 • Published • 16 -
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
Paper • 2406.09170 • Published • 28 -
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
Paper • 2407.18901 • Published • 35 -
Benchmarking Agentic Workflow Generation
Paper • 2410.07869 • Published • 28
-
MMTEB: Massive Multilingual Text Embedding Benchmark
Paper • 2502.13595 • Published • 38 -
MTEB: Massive Text Embedding Benchmark
Paper • 2210.07316 • Published • 6 -
The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding
Paper • 2406.02396 • Published -
Extending the Massive Text Embedding Benchmark to French
Paper • 2405.20468 • Published • 2
-
Offline Reinforcement Learning for LLM Multi-Step Reasoning
Paper • 2412.16145 • Published • 39 -
SPLADE-v3: New baselines for SPLADE
Paper • 2403.06789 • Published • 4 -
MMTEB: Massive Multilingual Text Embedding Benchmark
Paper • 2502.13595 • Published • 38 -
Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling
Paper • 2409.14683 • Published • 12
-
LoRA+: Efficient Low Rank Adaptation of Large Models
Paper • 2402.12354 • Published • 6 -
The FinBen: An Holistic Financial Benchmark for Large Language Models
Paper • 2402.12659 • Published • 22 -
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization
Paper • 2402.13249 • Published • 13 -
TrustLLM: Trustworthiness in Large Language Models
Paper • 2401.05561 • Published • 70
-
GAIA: a benchmark for General AI Assistants
Paper • 2311.12983 • Published • 226 -
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Paper • 2311.16502 • Published • 35 -
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 27 -
RULER: What's the Real Context Size of Your Long-Context Language Models?
Paper • 2404.06654 • Published • 39
-
MMTEB: Massive Multilingual Text Embedding Benchmark
Paper • 2502.13595 • Published • 38 -
MTEB: Massive Text Embedding Benchmark
Paper • 2210.07316 • Published • 6 -
The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding
Paper • 2406.02396 • Published -
Extending the Massive Text Embedding Benchmark to French
Paper • 2405.20468 • Published • 2
-
Offline Reinforcement Learning for LLM Multi-Step Reasoning
Paper • 2412.16145 • Published • 39 -
SPLADE-v3: New baselines for SPLADE
Paper • 2403.06789 • Published • 4 -
MMTEB: Massive Multilingual Text Embedding Benchmark
Paper • 2502.13595 • Published • 38 -
Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling
Paper • 2409.14683 • Published • 12
-
Self-Boosting Large Language Models with Synthetic Preference Data
Paper • 2410.06961 • Published • 17 -
Qwen2.5 Technical Report
Paper • 2412.15115 • Published • 373 -
SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation
Paper • 2412.13649 • Published • 20 -
NeoBERT: A Next-Generation BERT
Paper • 2502.19587 • Published • 40
-
CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery
Paper • 2406.08587 • Published • 16 -
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
Paper • 2406.09170 • Published • 28 -
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
Paper • 2407.18901 • Published • 35 -
Benchmarking Agentic Workflow Generation
Paper • 2410.07869 • Published • 28
-
LoRA+: Efficient Low Rank Adaptation of Large Models
Paper • 2402.12354 • Published • 6 -
The FinBen: An Holistic Financial Benchmark for Large Language Models
Paper • 2402.12659 • Published • 22 -
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization
Paper • 2402.13249 • Published • 13 -
TrustLLM: Trustworthiness in Large Language Models
Paper • 2401.05561 • Published • 70