Value alignment problems arise in scenarios where the specified objectives of an AI agent don't match the true underlying objective of its users. The problem has been widely argued to be one of the central safety problems in AI. Unfortunately, most existing works in value alignment tend to focus on issues that are primarily related to the fact that reward functions are an unintuitive mechanism to specify objectives. However, the complexity of the objective specification mechanism is just one of many reasons why the user may have misspecified their objective. A foundational cause for misalignment that is being overlooked by these works is the inherent asymmetry in human expectations about the agent's behavior and the behavior generated by the agent for the specified objective. To address this lacuna, we propose a novel formulation for the value alignment problem, named goal alignment that focuses on a few central challenges related to value alignment. In doing so, we bridge the currently disparate research areas of value alignment and human-aware planning. Additionally, we propose a first-of-its-kind interactive algorithm that is capable of using information generated under incorrect beliefs about the agent, to determine the true underlying goal of the user.
翻译:价值对齐问题出现在人工智能代理的指定目标与其用户真实潜在目标不匹配的场景中。该问题已被广泛认为是人工智能领域的核心安全挑战之一。遗憾的是,现有的大多数价值对齐研究往往聚焦于与奖励函数作为目标指定机制不够直观相关的问题。然而,目标指定机制的复杂性仅是用户可能错误指定目标的众多原因之一。这些研究忽视的一个根本性错位原因在于,人类对代理行为的预期与代理为指定目标所生成的行为之间存在固有的认知不对称性。为填补这一空白,我们提出了一种新的价值对齐问题形式化框架——目标对齐,该框架聚焦于价值对齐相关的若干核心挑战。通过这一工作,我们架起了当前相互独立的价值对齐研究与人类感知规划研究领域的桥梁。此外,我们首次提出了一种交互式算法,该算法能够利用在关于代理的错误信念下生成的信息,来确定用户的真实潜在目标。