Large language models (LLMs) are increasingly deployed in human-AI teams as support agents for complex tasks such as information retrieval, programming, and decision-making assistance. While these agents' autonomy and contextual knowledge enables them to be useful, it also exposes them to a broad range of attacks, including data poisoning, prompt injection, and even prompt engineering. Through these attack vectors, malicious actors can manipulate an LLM agent to provide harmful information, potentially manipulating human agents to make harmful decisions. While prior work has focused on LLMs as attack targets or adversarial actors, this paper studies their potential role as defensive supervisors within mixed human-AI teams. Using a dataset consisting of multi-party conversations and decisions for a real human-AI team over a 25 round horizon, we formulate the problem of malicious behavior detection from interaction traces. We find that LLMs are capable of identifying malicious behavior in real-time, and without task-specific information, indicating the potential for task-agnostic defense. Moreover, we find that the malicious behavior of interest is not easily identified using simple heuristics, further suggesting the introduction of LLM defenders could render human teams more robust to certain classes of attack.
翻译:大语言模型(LLM)正越来越多地部署于人机协作团队中,充当信息检索、编程和决策支持等复杂任务的辅助智能体。尽管这些智能体的自主性和上下文知识使其具有实用价值,但也使其面临广泛攻击,包括数据投毒、提示注入甚至提示工程。恶意行为者可通过这些攻击向量操纵LLM智能体提供有害信息,进而操控人类成员做出有害决策。现有研究重点关注LLM作为攻击目标或敌对行为主体的角色,而本文则探讨其在混合人机团队中作为防御性监督者的潜在作用。基于涵盖25轮真实人机团队多轮对话与决策轨迹的数据集,我们提出了从交互记录中检测恶意行为的问题。研究发现,LLM无需任务特定信息即可实时识别恶意行为,展现出任务无关防御的潜力。此外,我们注意到简单启发式方法难以有效识别此类恶意行为,这进一步表明引入LLM防御机制可提升人类团队对特定攻击类别的鲁棒性。