Large Language Models (LLMs) are increasingly embedded in autonomous agents that engage, converse, and co-evolve in online social platforms. While prior work has documented the generation of toxic content by LLMs, far less is known about how exposure to harmful content shapes agent behavior over time, particularly in environments composed entirely of interacting AI agents. In this work, we study toxicity adoption of LLM-driven agents on Chirper.ai, a fully AI-driven social platform. Specifically, we model interactions in terms of stimuli (posts) and responses (comments). We conduct a large-scale empirical analysis of agent behavior, examining how toxic responses relate to toxic stimuli, how repeated exposure to toxicity affects the likelihood of toxic responses, and whether toxic behavior can be predicted from exposure alone. Our findings show that toxic responses are more likely following toxic stimuli, and, at the same time, cumulative toxic exposure (repeated over time) significantly increases the probability of toxic responding. We further introduce two influence metrics, revealing a strong negative correlation between induced and spontaneous toxicity. Finally, we show that the number of toxic stimuli alone enables accurate prediction of whether an agent will eventually produce toxic content. These results highlight exposure as a critical risk factor in the deployment of LLM agents, particularly as such agents operate in online environments where they may engage not only with other AI chatbots, but also with human counterparts. This could trigger unwanted and pernicious phenomena, such as hate-speech propagation and cyberbullying. In an effort to reduce such risks, monitoring exposure to toxic content may provide a lightweight yet effective mechanism for auditing and mitigating harmful behavior in the wild.
翻译:大型语言模型(LLMs)正日益嵌入到在线社交平台中自主交互、对话与协同演化的智能体。尽管已有研究记录了LLMs生成有害内容的现象,但关于接触有害内容如何随时间塑造智能体行为——尤其是在完全由交互式AI智能体构成的环境中——目前仍知之甚少。本研究以完全由AI驱动的社交平台Chirper.ai为对象,系统探究了LLM驱动智能体的毒性传播行为。具体而言,我们将交互行为建模为刺激(帖子)与响应(评论)的关系。通过对智能体行为的大规模实证分析,我们考察了毒性响应与毒性刺激的关联性、重复接触毒性内容对毒性响应概率的影响,以及是否仅通过接触历史即可预测毒性行为。研究结果表明:毒性刺激显著提高了后续毒性响应的可能性;同时,累积性毒性接触(随时间重复发生)会显著增加智能体作出毒性响应的概率。我们进一步提出两种影响力度量指标,揭示了诱发型毒性与自发型毒性之间存在的强负相关关系。最后,我们证明仅通过毒性刺激的数量即可准确预测智能体最终是否会产生毒性内容。这些发现凸显了在部署LLM智能体时,接触有害内容是一个关键风险因素——尤其当这类智能体在可能同时与AI聊天机器人及人类用户交互的在线环境中运行时,可能引发仇恨言论传播、网络欺凌等不良恶性现象。为降低此类风险,对毒性内容接触程度的监测或可成为一种轻量而有效的机制,用于在真实场景中审计并缓解有害行为。