Large language models (LLMs) are currently at the forefront of intertwining artificial intelligence (AI) systems with human communication and everyday life. Thus, aligning them with human values is of great importance. However, given the steady increase in reasoning abilities, future LLMs are under suspicion of becoming able to deceive human operators and utilizing this ability to bypass monitoring efforts. As a prerequisite to this, LLMs need to possess a conceptual understanding of deception strategies. This study reveals that such strategies emerged in state-of-the-art LLMs, such as GPT-4, but were non-existent in earlier LLMs. We conduct a series of experiments showing that state-of-the-art LLMs are able to understand and induce false beliefs in other agents, that their performance in complex deception scenarios can be amplified utilizing chain-of-thought reasoning, and that eliciting Machiavellianism in LLMs can alter their propensity to deceive. In sum, revealing hitherto unknown machine behavior in LLMs, our study contributes to the nascent field of machine psychology.
翻译:大语言模型(LLMs)当前处于人工智能系统与人类交流及日常生活相互交织的前沿。因此,使其与人类价值观对齐至关重要。然而,随着推理能力的稳步提升,未来的大语言模型被怀疑可能具备欺骗人类操作者并利用这一能力规避监控的能力。作为这一前提,大语言模型需要具备对欺骗策略的概念性理解。本研究揭示了此类策略已在GPT-4等先进的大语言模型中出现,而在早期模型中并不存在。我们通过一系列实验表明,先进的大语言模型能够理解并诱导其他智能体产生错误信念,其复杂欺骗场景下的表现可通过思维链推理得到增强,并且诱发大语言模型中的马基雅维利主义可改变其欺骗倾向。总之,本研究揭示了大语言模型中此前未知的机器行为,为新兴的机器心理学领域做出贡献。