Large language models (LLMs) are currently at the forefront of intertwining artificial intelligence (AI) systems with human communication and everyday life. Thus, aligning them with human values is of great importance. However, given the steady increase in reasoning abilities, future LLMs are under suspicion of becoming able to deceive human operators and utilizing this ability to bypass monitoring efforts. As a prerequisite to this, LLMs need to possess a conceptual understanding of deception strategies. This study reveals that such strategies emerged in state-of-the-art LLMs, such as GPT-4, but were non-existent in earlier LLMs. We conduct a series of experiments showing that state-of-the-art LLMs are able to understand and induce false beliefs in other agents, that their performance in complex deception scenarios can be amplified utilizing chain-of-thought reasoning, and that eliciting Machiavellianism in LLMs can alter their propensity to deceive. In sum, revealing hitherto unknown machine behavior in LLMs, our study contributes to the nascent field of machine psychology.
翻译:大型语言模型(LLM)当前正处于人工智能系统与人类交流及日常生活深度融合的前沿。因此,使其与人类价值观对齐至关重要。然而,随着推理能力的稳步提升,未来LLM被怀疑可能具备欺骗人类操作者并利用该能力绕过监控的能力。作为实现这一能力的前提,LLM需要具备对欺骗策略的概念性理解。本研究表明,这种策略已出现在GPT-4等前沿LLM中,但在早期模型中并不存在。我们进行了一系列实验,证明前沿LLM能够理解并诱导其他智能体产生错误信念,其复杂欺骗场景中的表现可通过思维链推理得以增强,而激发LLM中的马基雅维利主义特质可改变其欺骗倾向。总之,通过揭示LLM中迄今未知的机器行为,本研究为新兴的机器心理学领域做出了贡献。