Abstract
Evaluation of agent-based systems reveals a collaboration gap where solo-performing models degrade in pairings, suggesting the need for collaboration-aware evaluation and training strategies.
The trajectory of AI development suggests that we will increasingly rely on agent-based systems composed of independently developed agents with different information, privileges, and tools. The success of these systems will critically depend on effective collaboration among these heterogeneous agents, even under partial observability. Despite intense interest, few empirical studies have evaluated such agent-agent collaboration at scale. We propose a collaborative maze-solving benchmark that (i) isolates collaborative capabilities, (ii) modulates problem complexity, (iii) enables scalable automated grading, and (iv) imposes no output-format constraints, preserving ecological plausibility. Using this framework, we evaluate 32 leading open- and closed-source models in solo, homogeneous, and heterogeneous pairings. Our results reveal a "collaboration gap": models that perform well solo often degrade substantially when required to collaborate. Collaboration can break down dramatically; for instance, small distilled models that solve mazes well alone may fail almost completely in certain pairings. We find that starting with the stronger agent often improves outcomes, motivating a "relay inference" approach where the stronger agent leads before handing off to the weaker one, closing much of the gap. Our findings argue for (1) collaboration-aware evaluation, (2) training strategies developed to enhance collaborative capabilities, and (3) interaction design that reliably elicits agents' latent skills, guidance that applies to AI-AI and human-AI collaboration.
Community
We’ve identified a “Collaboration Gap” in today’s top AI models.
Testing 32 leading LMs on our novel maze-solving benchmark, we found that models that excel solo can see their performance collapse when required to collaborate – even with an identical copy of themselves!
Why does this matter? The future of AI is unlikely to be one giant model; it's systems of multiple, independent AI agents with different information and skills. Current attempts at multi-agent systems rely on pre-defined communication protocols or central orchestration. In contrast, open-world integration likely requires flexible, on-the-fly communication to adapt to the diversity of the real world.
We provide insights into homogeneous and heterogeneous collaboration and explore a "relay" inference approach for effective heterogeneous deployment. Our findings argue that collaboration is a distinct capability that current training strategies fail to capture. We shouldn’t just hope for it to emerge – we must design for it. This means new evals, training strategies, and interaction designs.
Evaluating the sort of behavior is one of the first things I did with gpt 3.5. for my particular use case at the time, it was actually easier to use the DaVinci model to get what I needed.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper