We introduce CRAFT, a multi-agent benchmark for evaluating pragmatic communication in large language models under strict partial information. In this setting, multiple agents with complementary but incomplete views must coordinate through natural language to construct a shared 3D structure that no single agent can fully observe. We formalize this problem as a multi-sender pragmatic reasoning task and provide a diagnostic framework that decomposes failures into spatial grounding, belief modeling and pragmatic communication errors, including a taxonomy of behavioral failure profiles in both frontier and open-weight models. Across a diverse set of models, including 8 open-weight and 7 frontier including reasoning models, we find that stronger reasoning ability does not reliably translate to better coordination: smaller open-weight models often match or outperform frontier systems, and improved individual communication does not guarantee successful collaboration. These results suggest that multi-agent coordination remains a fundamentally unsolved challenge for current language models. Our code can be found at https://github.com/csu-signal/CRAFT
翻译:我们提出了CRAFT,一个用于评估大语言模型在严格部分信息条件下进行语用通信的多智能体基准测试。在此设定中,多个具备互补但不完整视角的智能体必须通过自然语言协调,以构建一个任何单个智能体都无法完全观察到的共享3D结构。我们将该问题形式化为一个多发送方语用推理任务,并提供了一个诊断框架,将失败原因分解为空间定位、信念建模和语用通信错误,包括对前沿模型和开源权重模型中行为失败模式的分类。在包括8个开源权重模型和7个前沿模型(含推理模型)在内的多样化模型集合中,我们发现更强的推理能力并不总能可靠地转化为更好的协调效果:较小的开源权重模型通常能与前沿系统匹敌甚至表现更优,而个体通信能力的提升并不能保证成功的协作。这些结果表明,对于当前的语言模型而言,多智能体协调仍然是一个未得到根本解决的挑战。我们的代码可在 https://github.com/csu-signal/CRAFT 获取。