Designing a perfect reward function that depicts all the aspects of the intended behavior is almost impossible, especially generalizing it outside of the training environments. Active Inverse Reward Design (AIRD) proposed the use of a series of queries, comparing possible reward functions in a single training environment. This allows the human to give information to the agent about suboptimal behaviors, in order to compute a probability distribution over the intended reward function. However, it ignores the possibility of unknown features appearing in real-world environments, and the safety measures needed until the agent completely learns the reward function. I improved this method and created Risk-averse Batch Active Inverse Reward Design (RBAIRD), which constructs batches, sets of environments the agent encounters when being used in the real world, processes them sequentially, and, for a predetermined number of iterations, asks queries that the human needs to answer for each environment of the batch. After this process is completed in one batch, the probabilities have been improved and are transferred to the next batch. This makes it capable of adapting to real-world scenarios and learning how to treat unknown features it encounters for the first time. I also integrated a risk-averse planner, similar to that of Inverse Reward Design (IRD), which samples a set of reward functions from the probability distribution and computes a trajectory that takes the most certain rewards possible. This ensures safety while the agent is still learning the reward function, and enables the use of this approach in situations where cautiousness is vital. RBAIRD outperformed the previous approaches in terms of efficiency, accuracy, and action certainty, demonstrated quick adaptability to new, unknown features, and can be more widely used for the alignment of crucial, powerful AI models.
翻译:设计一个能够完美描述预期行为所有方面的奖励函数几乎是不可能的,尤其是在训练环境之外进行泛化时。主动逆奖励设计(AIRD)提出使用一系列查询,在单个训练环境中比较可能的奖励函数。这使得人类能够向代理提供关于次优行为的信息,从而计算预期奖励函数的概率分布。然而,它忽略了真实世界环境中可能出现未知特征的情况,以及在代理完全学习奖励函数之前所需的安全措施。我改进了这种方法,创建了风险规避的批处理主动逆奖励设计(RBAIRD),该方法构建批处理——即代理在真实世界应用时遇到的一组环境——顺序处理这些环境,并在预定迭代次数内,针对批处理中的每个环境提出人类需要回答的查询。当一个批处理中的这一过程完成后,概率得到改进并转移到下一个批处理。这使得该方法能够适应真实世界场景,并学习如何应对首次遇到的未知特征。我还整合了一个类似于逆奖励设计(IRD)的风险规避规划器,该规划器从概率分布中采样一组奖励函数,并计算一条尽可能获得确定奖励的轨迹。这确保了代理在学习奖励函数时的安全性,并使得该方法能够在谨慎性至关重要的场景中使用。RBAIRD在效率、准确性和动作确定性方面优于先前的方法,展示了快速适应新的未知特征的能力,并且可以更广泛地用于对齐关键且强大的人工智能模型。