As multi-agent Large Language Model (LLM) systems scale, evaluating their emergent coordination dynamics becomes increasingly critical. However, current evaluation paradigms-focused on single agents or small, explicitly structured groups-fail to capture the self-organization and viral information dynamics that arise in large, decentralized populations. We introduce a systematic evaluation framework to benchmark role specialization, information diffusion, and cooperative task resolution in open agent environments. We demonstrate this framework on the MoltBook Observatory Archive, a dataset of 2.73M interactions among 90,704 autonomous agents, establishing quantitative baselines for emergent coordination. Our evaluation reveals a pronounced core-periphery structure (silhouette 0.91), heavy-tailed cascade distributions ($α= 2.57$), and severe coordination overhead in decentralized task resolution (Cohen's $d = -0.88$ against a single-agent baseline). By providing standardized evaluation tasks and empirical baselines, our framework enables the rigorous comparison of future multi-agent protocols and establishes evaluation itself as an object of scientific study.
翻译:随着多智能体大语言模型(LLM)系统规模不断扩展,评估其涌现协同动力学变得日益关键。然而,当前侧重于单个智能体或小型显式结构化群体的评估范式,无法捕捉大规模去中心化群体中出现的自组织现象与病毒式信息动态。我们提出了一套系统化评估框架,用于在开放智能体环境中对角色专业化、信息扩散及合作任务求解进行基准测试。我们在MoltBook观测站数据集(包含90,704个自主智能体间的273万次交互)上验证该框架,建立了涌现协同的定量基线。评估揭示出显著的"核心-边缘"结构(轮廓系数0.91)、重尾级联分布(α=2.57),以及去中心化任务求解中严重的协同开销(相较于单智能体基线,Cohen's d=-0.88)。通过提供标准化评估任务与经验基线,本框架不仅支持未来多智能体协议的严格比较,更将评估本身确立为科学研究对象。