The Incomplete Preferences Proposal (IPP) is an idea for ensuring that advanced artificial agents never resist shutdown. A key part of the IPP is using a novel `Discounted Reward for Same-Length Trajectories (DReST)' reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be `USEFUL'), and (2) choose stochastically between different trajectory-lengths (be `NEUTRAL' about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DReST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus provide some initial evidence that DReST reward functions could train advanced agents to be USEFUL and NEUTRAL. Our theoretical work suggests that these agents would be useful and shutdownable.
翻译:不完整偏好提案(IPP)是一种确保高级人工智能体永不抗拒关闭的理念。IPP的关键部分在于使用一种新颖的“等长轨迹折扣奖励(DReST)”奖励函数来训练智能体,使其能够:(1)在每个轨迹长度条件下有效追求目标(具备“实用性”),(2)在不同轨迹长度之间进行随机选择(对轨迹长度保持“中立性”)。本文提出了实用性与中立性的评估指标。我们采用DReST奖励函数训练简单智能体在网格世界中导航,发现这些智能体能够学会实用且中立的行为。因此,我们的研究初步证明DReST奖励函数可能训练出具备实用性与中立性的高级智能体。理论分析表明,此类智能体将兼具实用性与可关闭性。