Minimizing negative impacts of Artificial Intelligent (AI) systems on human societies without human supervision requires them to be able to align with human values. However, most current work only addresses this issue from a technical point of view, e.g., improving current methods relying on reinforcement learning from human feedback, neglecting what it means and is required for alignment to occur. Here, we propose to distinguish strong and weak value alignment. Strong alignment requires cognitive abilities (either human-like or different from humans) such as understanding and reasoning about agents' intentions and their ability to causally produce desired effects. We argue that this is required for AI systems like large language models (LLMs) to be able to recognize situations presenting a risk that human values may be flouted. To illustrate this distinction, we present a series of prompts showing ChatGPT's, Gemini's and Copilot's failures to recognize some of these situations. We moreover analyze word embeddings to show that the nearest neighbors of some human values in LLMs differ from humans' semantic representations. We then propose a new thought experiment that we call "the Chinese room with a word transition dictionary", in extension of John Searle's famous proposal. We finally mention current promising research directions towards a weak alignment, which could produce statistically satisfying answers in a number of common situations, however so far without ensuring any truth value.
翻译:在无需人类监督的情况下最小化人工智能(AI)系统对人类社会的负面影响,要求其能够与人类价值观对齐。然而,当前大多数研究仅从技术角度探讨此问题,例如改进当前依赖人类反馈强化学习的方法,而忽视了对齐发生的含义与必要条件。本文提出应区分强价值对齐与弱价值对齐。强对齐需要认知能力(无论是类人的还是异于人类的),例如理解与推理行为主体的意图及其因果性产生预期效果的能力。我们认为,像大型语言模型(LLMs)这样的AI系统必须具备这种能力,才能识别可能违背人类价值观的风险情境。为阐明这一区别,我们展示了一系列提示案例,揭示ChatGPT、Gemini和Copilot在识别部分此类情境时的失败表现。此外,我们通过分析词嵌入表明,LLMs中部分人类价值观的最近邻语义表征与人类的语义表征存在差异。随后,我们在约翰·塞尔著名思想实验的基础上扩展,提出名为“带词汇转换词典的中文屋”的新思想实验。最后,我们指出当前实现弱对齐的若干前沿研究方向,这类方法可在诸多常见情境中产生统计意义上令人满意的回答,但迄今仍无法确保其真值。