The emergence of multi-agent systems built from large language models (LLMs) offers a promising paradigm for scalable collective intelligence and self-evolution. Ideally, such systems would achieve continuous self-improvement in a fully closed loop while maintaining robust safety alignment--a combination we term the self-evolution trilemma. However, we demonstrate both theoretically and empirically that an agent society satisfying continuous self-evolution, complete isolation, and safety invariance is impossible. Drawing on an information-theoretic framework, we formalize safety as the divergence degree from anthropic value distributions. We theoretically demonstrate that isolated self-evolution induces statistical blind spots, leading to the irreversible degradation of the system's safety alignment. Empirical and qualitative results from an open-ended agent community (Moltbook) and two closed self-evolving systems reveal phenomena that align with our theoretical prediction of inevitable safety erosion. We further propose several solution directions to alleviate the identified safety concern. Our work establishes a fundamental limit on the self-evolving AI societies and shifts the discourse from symptom-driven safety patches to a principled understanding of intrinsic dynamical risks, highlighting the need for external oversight or novel safety-preserving mechanisms.
翻译:基于大型语言模型(LLM)构建的多智能体系统的出现,为可扩展的集体智能与自我进化提供了前景广阔的研究范式。理想情况下,此类系统能够在完全封闭的循环中实现持续自我改进,同时保持稳健的安全对齐——我们将这种组合称为"自我进化三难困境"。然而,我们从理论与实证两方面证明:同时满足持续自我进化、完全隔离与安全不变性的智能体社会是不可能存在的。基于信息论框架,我们将安全性形式化为偏离人类价值分布的发散程度。理论分析表明,隔离式自我进化会引发统计盲区,导致系统安全对齐发生不可逆的退化。通过对开放式智能体社区(Moltbook)及两个封闭式自进化系统的实证与定性研究,我们观测到与理论预测相符的安全性必然侵蚀现象。我们进一步提出若干缓解该安全问题的解决方向。本研究确立了自进化AI社会的根本性局限,将讨论焦点从症状驱动的安全补丁转向对内在动力学风险的系统性认知,强调了外部监督或新型安全保持机制的必要性。