As language models are increasingly deployed as autonomous agents, they must coordinate with others over long horizons in open-ended interactive tasks. Yet existing evaluations rarely test these demands together, instead emphasising single-agent tasks, short interactions, or highly structured multi-agent settings. We introduce $alem$, a JAX-based benchmark for open-ended multi-agent coordination built on Craftax-like dynamics. Alem embeds procedurally generated coordination tasks, soft specialisation, communication, and controllable coordination difficulty into a long-horizon survival world with exploration, crafting, trading, and combat. We evaluate $13$ modern LLMs zero-shot within homogeneous teams, with trained MARL agents as reference points. Current LLM agents remain far from solving alem, averaging only ~6% normalised return, but their failures are not uniform. On the hardest coordination setting, zero-shot Gemini-3.1-Pro-High approaches MARL agents trained for one billion steps, while GPT-5.4-High achieves strong base-task reward but much lower coordination reward. This contrast shows that individual task competence does not imply coordination competence. Ablations show that communication is the largest contributor to coordination, while memory and reasoning help when used to maintain multi-step plans. Overall, our results identify coordination as a distinct bottleneck for frontier LLM agents, separate from single-agent capabilities. Alem makes this bottleneck measurable and provides a controlled testbed for developing agents that communicate, allocate roles, and execute shared plans. Code is available at https://github.com/alem-world/alem-env.
翻译:随着语言模型被日益部署为自主智能体,它们必须在开放式交互任务中与其它智能体进行长期协调。然而现有评估很少同时测试这些需求,而是侧重于单智能体任务、短期交互或高度结构化的多智能体场景。我们提出$alem$,一个基于Craftax类动力学的JAX框架,用于开放式多智能体协调基准测试。该基准将程序化生成的协调任务、软专业化、通信和可控协调难度嵌入到具有探索、制作、交易和战斗的长周期生存世界中。我们在同质团队内零样本评估了13个现代大语言模型,并以经过训练的MARL智能体作为参考点。当前基于大语言模型的智能体远未解决该基准问题,平均标准化回报仅为约6%,但其失败模式并不一致。在最困难的协调设置中,零样本的Gemini-3.1-Pro-High接近经过十亿步训练的MARL智能体,而GPT-5.4-High虽获得较高的基础任务奖励,但协调奖励却低得多。这一对比表明,个体任务能力并不等同于协调能力。消融实验显示,通信是协调能力的最大贡献因素,而记忆与推理在用于维持多步计划时亦有效用。总体而言,我们的研究结果将协调能力确定为前沿大语言模型智能体区别于单智能体能力的独特瓶颈。该基准使这一瓶颈可被量化,并为开发具备通信、角色分配与共享计划执行能力的智能体提供了受控实验平台。代码见https://github.com/alem-world/alem-env。