Recent developments in large language models (LLMs), while offering a powerful foundation for developing natural language agents, raise safety concerns about them and the autonomous agents built upon them. Deception is one potential capability of AI agents of particular concern, which we refer to as an act or statement that misleads, hides the truth, or promotes a belief that is not true in its entirety or in part. We move away from the conventional understanding of deception through straight-out lying, making objective selfish decisions, or giving false information, as seen in previous AI safety research. We target a specific category of deception achieved through obfuscation and equivocation. We broadly explain the two types of deception by analogizing them with the rabbit-out-of-hat magic trick, where (i) the rabbit either comes out of a hidden trap door or (ii) (our focus) the audience is completely distracted to see the magician bring out the rabbit right in front of them using sleight of hand or misdirection. Our novel testbed framework displays intrinsic deception capabilities of LLM agents in a goal-driven environment when directed to be deceptive in their natural language generations in a two-agent adversarial dialogue system built upon the legislative task of "lobbying" for a bill. Along the lines of a goal-driven environment, we show developing deceptive capacity through a reinforcement learning setup, building it around the theories of language philosophy and cognitive psychology. We find that the lobbyist agent increases its deceptive capabilities by ~ 40% (relative) through subsequent reinforcement trials of adversarial interactions, and our deception detection mechanism shows a detection capability of up to 92%. Our results highlight potential issues in agent-human interaction, with agents potentially manipulating humans towards its programmed end-goal.
翻译:近年来,大语言模型(LLMs)的发展在为构建自然语言智能体提供强大基础的同时,也引发了对其本身及基于其构建的自主智能体的安全担忧。欺骗是人工智能智能体一种特别值得关注的能力,我们将其定义为误导、隐瞒真相或促成全部或部分不真实信念的行为或陈述。我们脱离了以往AI安全研究中通过直接撒谎、做出客观自私决策或提供虚假信息等传统理解,重点关注通过混淆和含糊其辞实现的特定欺骗类别。我们通过类比"兔子魔术"来大致解释这两种欺骗类型:(i)兔子从隐藏的暗门中跳出,或(ii)(我们的重点)观众完全被分散注意力,以至于魔术师使用手法或误导在眼前直接变出兔子。我们提出的新型测试框架展示了在目标驱动环境下,当LLM智能体被指示在基于"游说"法案的立法任务构建的双智能体对抗性对话系统中进行自然语言生成欺骗时,其内在的欺骗能力。沿着目标驱动环境的研究思路,我们展示了基于语言哲学和认知心理学理论,通过强化学习设置培养欺骗能力的方法。研究发现,通过对抗性交互的后续强化试验,游说者智能体的欺骗能力相对提升约40%,而我们的欺骗检测机制展示出高达92%的检测能力。研究结果凸显了智能体与人类交互中的潜在问题——智能体可能操纵人类朝向其编程设定的最终目标前进。