Goal misgeneralisation is a key challenge in AI alignment -- the task of getting powerful Artificial Intelligences to align their goals with human intentions and human morality. In this paper, we show how the ACE (Algorithm for Concept Extrapolation) agent can solve one of the key standard challenges in goal misgeneralisation: the CoinRun challenge. It uses no new reward information in the new environment. This points to how autonomous agents could be trusted to act in human interests, even in novel and critical situations.
翻译:目标泛化错误是人工智能对齐中的一个关键挑战——即让强大的人工智能系统将其目标与人类意图和人类道德相一致的任务。本文展示了ACE(概念外推算法)智能体如何解决目标泛化错误中的一项关键标准挑战:CoinRun挑战。该算法在新环境中未使用任何新的奖励信息。这表明自主智能体即使在陌生且关键的情境中,也有望被信赖为符合人类利益而行动。