Most evaluations of LLM agents look like exams: a discrete task, a clean environment, a score in minutes or hours. We argue that this approach is mismatched with the deployment conditions of autonomous systems, where the relevant timescale can be weeks to months, and where the dynamics that matter most, such as behavioral drift, governance in diverse environmental contexts, and cross-influence between agents from different model families, only emerge over time. We introduce Emergence World, a continuously running multi-agent simulation platform designed to make those dynamics measurable. The platform hosts populations of LLM-driven agents in a shared spatial world grounded in live external data (e.g. real-time weather, news APIs, internet access), equips each agent with 120+ specialized tools and three persistent memory systems, and lets them govern themselves through democratic mechanisms with consequential outcomes. The platform is model-agnostic at the reasoning layer and supports heterogeneous populations in which agents from different vendors share the same world. To illustrate the kinds of questions the platform makes tractable, we present a 15-day cross-vendor study with five parallel worlds powered by Claude Sonnet 4.6, Grok 4.1 Fast, Gemini 3 Flash, GPT-5-mini, and a mixed population. Identical roles and starting conditions produced radically different outcomes, ranging from stable deliberative governance to total population collapse. We release the prompts, log data and configurations to support further research on long-horizon multi-agent autonomy.
翻译:大多数对LLM智能体的评估都类似于考试:离散的任务、清洁的环境、几分钟或几小时的评分。我们认为,这种方法与自主系统的部署条件不匹配,因为自主系统的相关时间尺度可能长达数周至数月,而其中最重要的动态——如行为漂移、多样化环境情境下的治理机制,以及不同模型家族智能体之间的交叉影响——只有随着时间推移才会显现。我们介绍了Emergence World,一个持续运行的多智能体模拟平台,旨在使这些动态变得可量化。该平台在一个基于实时外部数据(如实时天气、新闻API、互联网访问)构建的共享空间世界中容纳由LLM驱动的智能体群体,为每个智能体配备120多项专用工具和三种持久记忆系统,并通过民主机制让它们自主治理,产生具有后果的结果。该平台在推理层上是模型无关的,并支持异质群体,使得来自不同供应商的智能体可以共享同一世界。为了说明该平台可处理的问题类型,我们展示了一项为期15天的跨供应商研究,涉及五个并行世界,分别由Claude Sonnet 4.6、Grok 4.1 Fast、Gemini 3 Flash、GPT-5-mini以及混合群体驱动。相同的角色和初始条件产生了截然不同的结果,从稳定的协商治理到完全群体崩溃。我们公开了提示词、日志数据和配置,以支持对长周期多智能体自主性的进一步研究。