In dynamic programming and reinforcement learning, the policy for the sequential decision making of an agent in a stochastic environment is usually determined by expressing the goal as a scalar reward function and seeking a policy that maximizes the expected total reward. However, many goals that humans care about naturally concern multiple aspects of the world, and it may not be obvious how to condense those into a single reward function. Furthermore, maximization suffers from specification gaming, where the obtained policy achieves a high expected total reward in an unintended way, often taking extreme or nonsensical actions. Here we consider finite acyclic Markov Decision Processes with multiple distinct evaluation metrics, which do not necessarily represent quantities that the user wants to be maximized. We assume the task of the agent is to ensure that the vector of expected totals of the evaluation metrics falls into some given convex set, called the aspiration set. Our algorithm guarantees that this task is fulfilled by using simplices to approximate feasibility sets and propagate aspirations forward while ensuring they remain feasible. It has complexity linear in the number of possible state-action-successor triples and polynomial in the number of evaluation metrics. Moreover, the explicitly non-maximizing nature of the chosen policy and goals yields additional degrees of freedom, which can be used to apply heuristic safety criteria to the choice of actions. We discuss several such safety criteria that aim to steer the agent towards more conservative behavior.
翻译:在动态规划与强化学习中,智能体在随机环境中的序贯决策策略通常通过将目标表达为标量奖励函数并寻求最大化期望总奖励的策略来确定。然而,人类关注的许多目标天然涉及世界的多个方面,如何将这些方面压缩为单一奖励函数往往并不明确。此外,最大化方法存在规范博弈问题:所得策略可能以非预期方式实现高期望总奖励,常采取极端或无意义的行动。本文研究具有多个不同评估指标的有限无环马尔可夫决策过程,这些指标未必代表用户希望最大化的量。我们假设智能体的任务是确保评估指标期望总值向量落入某个给定的凸集(称为期望集)。我们的算法通过使用单纯形逼近可行性集并向前传播期望值,同时确保其始终可行,从而保证该任务的实现。算法复杂度与状态-动作-后继三元组数量呈线性关系,与评估指标数量呈多项式关系。此外,所选策略与目标明确具有非最大化特性,这产生了额外的自由度,可用于对动作选择施加启发式安全准则。我们讨论了若干旨在引导智能体采取更保守行为的安全准则。