In safe MDP planning, a cost function based on the current state and action is often used to specify safety aspects. In the real world, often the state representation used may lack sufficient fidelity to specify such safety constraints. Operating based on an incomplete model can often produce unintended negative side effects (NSEs). To address these challenges, first, we associate safety signals with state-action trajectories (rather than just an immediate state-action). This makes our safety model highly general. We also assume categorical safety labels are given for different trajectories, rather than a numerical cost function, which is harder to specify by the problem designer. We then employ a supervised learning model to learn such non-Markovian safety patterns. Second, we develop a Lagrange multiplier method, which incorporates the safety model and the underlying MDP model in a single computation graph to facilitate agent learning of safe behaviors. Finally, our empirical results on a variety of discrete and continuous domains show that this approach can satisfy complex non-Markovian safety constraints while optimizing an agent's total returns, is highly scalable, and is also better than the previous best approach for Markovian NSEs.
翻译:在安全MDP规划中,通常基于当前状态和动作定义代价函数来指定安全准则。然而现实世界中,用于表示状态的特征往往缺乏足够精度来准确描述此类安全约束。基于不完整模型的操作常会产生非预期的负面副作用(NSEs)。为应对这些挑战,首先,我们将安全信号与状态-动作轨迹(而非仅即时状态-动作对)相关联,这使得安全模型具有高度通用性。同时,我们采用轨迹层面的分类安全标签替代难以为问题设计者指定的数值型代价函数,并利用监督学习模型学习此类非马尔可夫安全模式。其次,我们提出拉格朗日乘子法,将安全模型与底层MDP模型融合至统一计算图中,以促进智能体学习安全行为。最后,在多种离散与连续域上的实验表明,该方法能在优化智能体总回报的同时满足复杂非马尔可夫安全约束,具有良好的可扩展性,且在马尔可夫型负面副作用处理上优于现有最优方法。