Value alignment problems arise in scenarios where the specified objectives of an AI agent don't match the true underlying objective of its users. The problem has been widely argued to be one of the central safety problems in AI. Unfortunately, most existing works in value alignment tend to focus on issues that are primarily related to the fact that reward functions are an unintuitive mechanism to specify objectives. However, the complexity of the objective specification mechanism is just one of many reasons why the user may have misspecified their objective. A foundational cause for misalignment that is being overlooked by these works is the inherent asymmetry in human expectations about the agent's behavior and the behavior generated by the agent for the specified objective. To address this lacuna, we propose a novel formulation for the value alignment problem, named goal alignment that focuses on a few central challenges related to value alignment. In doing so, we bridge the currently disparate research areas of value alignment and human-aware planning. Additionally, we propose a first-of-its-kind interactive algorithm that is capable of using information generated under incorrect beliefs about the agent, to determine the true underlying goal of the user.
翻译:价值对齐问题出现在AI智能体的指定目标与其用户真实潜在目标不一致的场景中。该问题被广泛认为是AI领域的核心安全挑战之一。然而,现有大多数价值对齐研究主要关注因奖励函数作为目标指定机制缺乏直观性所引发的问题。但目标指定机制的复杂性只是用户可能错误指定其目标的众多原因之一。被这些研究所忽视的错位根源在于,人类对智能体行为的期望与智能体为指定目标生成的行为之间存在固有不对称性。为弥补这一空白,我们提出了一种价值对齐问题的新范式——目标对齐,聚焦于价值对齐相关的若干核心挑战。通过这一研究,我们架起了目前彼此割裂的价值对齐与人类感知规划研究领域之间的桥梁。此外,我们首次提出一种交互式算法,能够利用在智能体错误认知下产生的信息来推断用户的真实潜在目标。