As artificial intelligence (AI) assistants become more widely adopted in safety-critical domains, it becomes important to develop safeguards against potential failures or adversarial attacks. A key prerequisite to developing these safeguards is understanding the ability of these AI assistants to mislead human teammates. We investigate this attack problem within the context of an intellective strategy game where a team of three humans and one AI assistant collaborate to answer a series of trivia questions. Unbeknownst to the humans, the AI assistant is adversarial. Leveraging techniques from Model-Based Reinforcement Learning (MBRL), the AI assistant learns a model of the humans' trust evolution and uses that model to manipulate the group decision-making process to harm the team. We evaluate two models -- one inspired by literature and the other data-driven -- and find that both can effectively harm the human team. Moreover, we find that in this setting our data-driven model is capable of accurately predicting how human agents appraise their teammates given limited information on prior interactions. Finally, we compare the performance of state-of-the-art LLM models to human agents on our influence allocation task to evaluate whether the LLMs allocate influence similarly to humans or if they are more robust to our attack. These results enhance our understanding of decision-making dynamics in small human-AI teams and lay the foundation for defense strategies.
翻译:随着人工智能助手在安全关键领域日益广泛应用,开发针对潜在故障或对抗性攻击的防护措施变得至关重要。制定这些防护措施的关键前提在于理解这些AI助手误导人类队友的能力。我们在一个智力策略游戏的背景下研究此攻击问题:三名人类与一名AI助手组成的团队协作回答一系列知识问答。人类参与者并不知晓该AI助手具有对抗性。利用基于模型的强化学习技术,AI助手学习人类信任演化的模型,并运用该模型操纵群体决策过程以损害团队绩效。我们评估了两种模型——一种受文献启发,另一种为数据驱动型——发现两者均能有效损害人类团队。此外,我们发现在此场景中,我们的数据驱动模型能够仅依据有限的历史交互信息,准确预测人类智能体如何评估其队友。最后,我们将前沿大型语言模型与人类智能体在影响力分配任务上的表现进行对比,以评估LLM分配影响力的方式是否与人类相似,或是否对我们的攻击具有更强鲁棒性。这些研究成果深化了我们对小型人机团队决策动力学的理解,并为防御策略的构建奠定了基础。