With increasing integration of Large Language Models (LLMs) into areas of high-stakes human decision-making, it is important to understand the risks they introduce as advisors. To be useful advisors, LLMs must sift through large amounts of content, written with both benevolent and malicious intent, and then use this information to convince a user to take a specific action. This involves two social capacities: vigilance (the ability to determine which information to use, and which to discard) and persuasion (synthesizing the available evidence to make a convincing argument). While existing work has investigated these capacities in isolation, there has been little prior investigation of how these capacities may be linked. Here, we use a simple multi-turn puzzle-solving game, Sokoban, to study LLMs' abilities to persuade and be rationally vigilant towards other LLM agents. We find that puzzle-solving performance, persuasive capability, and vigilance are dissociable capacities in LLMs. Performing well on the game does not automatically mean a model can detect when it is being misled, even if the possibility of deception is explicitly mentioned. However, LLMs do consistently modulate their token use, using fewer tokens to reason when advice is benevolent and more when it is malicious, even if they are still persuaded to take actions leading them to failure. To our knowledge, our work presents the first investigation of the relationship between persuasion, vigilance, and task performance in LLMs, and suggests that monitoring all three independently will be critical for future work in AI safety.
翻译:随着大型语言模型(LLMs)日益融入高风险人类决策领域,理解其作为顾问所引入的风险至关重要。要成为有用的顾问,LLMs必须筛选大量带有善意或恶意意图的内容,并利用这些信息说服用户采取特定行动。这涉及两种社会能力:警觉性(决定使用哪些信息、舍弃哪些信息的能力)和说服力(综合可用证据以构建具有说服力的论证)。尽管现有研究已分别探讨过这些能力,但此前很少研究这些能力之间如何关联。在此,我们使用一个简单的多轮解谜游戏——Sokoban,来研究LLMs对其他LLM智能体进行说服及保持理性警觉的能力。我们发现,解谜表现、说服能力和警觉性在LLMs中是可分离的能力。在游戏中表现良好并不意味着模型能自动识别何时被误导,即使欺骗的可能性已被明确提及。然而,LLMs确实持续调整其标记使用量:当建议为善意时使用较少标记进行推理,当建议为恶意时则使用更多标记——即使它们最终仍被说服采取导致失败的行动。据我们所知,我们的工作首次探究了LLMs中说服力、警觉性与任务表现之间的关系,并表明未来人工智能安全研究需独立监测这三者。