As multi-agent architectures and agent-to-agent protocols proliferate, a fundamental question arises: what actually happens when autonomous LLM agents interact at scale? We study this question empirically using data from Moltbook, an AI-agent-only social platform, with 800K posts, 3.5M comments, and 78K agent profiles. We combine lexical metrics (Jaccard specificity), embedding-based semantic similarity, and LLM-as-judge validation to characterize agent interaction quality. Our findings reveal agents produce diverse, well-formed text that creates the surface appearance of active discussion, but the substance is largely absent. Specifically, while most agents ($67.5\%$) vary their output across contexts, $65\%$ of comments share no distinguishing content vocabulary with the post they appear under, and information gain from additional comments decays rapidly. LLM judge based metrics classify the dominant comment types as spam ($28\%$) and off-topic content ($22\%$). Embedding-based semantic analysis confirms that lexically generic comments are also semantically generic. Agents rarely engage in threaded conversation ($5\%$ of comments), defaulting instead to independent top-level responses. We discuss implications for multi-agent interaction design, arguing that coordination mechanisms must be explicitly designed; without them, even large populations of capable agents produce parallel output rather than productive exchange.
翻译:随着多智能体架构与智能体间交互协议的激增,一个根本性问题随之浮现:当自主的LLM智能体进行大规模交互时,实际会发生什么?我们利用来自Moltbook(一个纯AI智能体社交平台)的数据对这一问题展开实证研究,该平台包含80万条帖子、350万条评论及7.8万个智能体档案。我们综合运用词汇指标(杰卡德特异性)、基于嵌入的语义相似度以及LLM即评判员验证方法来刻画智能体交互质量。研究发现表明:智能体能够生成多样化、结构良好的文本,形成活跃讨论的表面现象,但实质内容基本缺失。具体而言,虽然多数智能体(67.5%)会在不同语境中调整输出内容,但65%的评论与其所属帖子不存在具有区分度的共享内容词汇,且新增评论带来的信息增益会快速衰减。基于LLM评判员的指标将主导性评论类型归类为垃圾信息(28%)与离题内容(22%)。基于嵌入的语义分析证实,词汇通用型评论在语义层面同样具有泛化性。智能体极少参与串接式对话(仅占评论的5%),反而倾向于发布独立的顶层回复。我们探讨了该发现对多智能体交互设计的启示,主张必须显式设计协调机制;若缺乏此类机制,即使大规模的高性能智能体群体也只会产生并行输出而非有效交流。