Goal misgeneralisation is a key challenge in AI alignment -- the task of getting powerful Artificial Intelligences to align their goals with human intentions and human morality. In this paper, we show how the ACE (Algorithm for Concept Extrapolation) agent can solve one of the key standard challenges in goal misgeneralisation: the CoinRun challenge. It uses no new reward information in the new environment. This points to how autonomous agents could be trusted to act in human interests, even in novel and critical situations.
翻译:目标泛化错误是AI对齐中的一个关键挑战——即让强大的人工智能将其目标与人类意图和人类道德相一致的任务。在本文中,我们展示了ACE(概念外推算法)智能体如何解决目标泛化错误中的一项关键标准挑战:CoinRun挑战。该智能体在新环境中未使用任何新的奖励信息。这指出了自主智能体如何在甚至新颖而关键的情况下被信任以符合人类利益行动。