One of the main challenges in imitation learning is determining what action an agent should take when outside the state distribution of the demonstrations. Inverse reinforcement learning (IRL) can enable generalization to new states by learning a parameterized reward function, but these approaches still face uncertainty over the true reward function and corresponding optimal policy. Existing safe imitation learning approaches based on IRL deal with this uncertainty using a maxmin framework that optimizes a policy under the assumption of an adversarial reward function, whereas risk-neutral IRL approaches either optimize a policy for the mean or MAP reward function. While completely ignoring risk can lead to overly aggressive and unsafe policies, optimizing in a fully adversarial sense is also problematic as it can lead to overly conservative policies that perform poorly in practice. To provide a bridge between these two extremes, we propose Bayesian Robust Optimization for Imitation Learning (BROIL). BROIL leverages Bayesian reward function inference and a user specific risk tolerance to efficiently optimize a robust policy that balances expected return and conditional value at risk. Our empirical results show that BROIL provides a natural way to interpolate between return-maximizing and risk-minimizing behaviors and outperforms existing risk-sensitive and risk-neutral inverse reinforcement learning algorithms. Code is available at https://github.com/dsbrown1331/broil.
翻译:模仿学习的主要挑战之一在于,当智能体处于演示状态分布之外时,如何确定其应执行的动作。逆强化学习(IRL)通过学习参数化奖励函数可实现对新状态的泛化,但这些方法仍面临真实奖励函数及相应最优策略的不确定性。现有基于IRL的安全模仿学习方法通过极大极小框架处理这种不确定性——该框架假设存在对抗性奖励函数并据此优化策略,而风险中性的IRL方法则针对均值或最大后验奖励函数优化策略。完全忽略风险可能导致过于激进且不安全的策略,但纯粹以对抗性方式优化也存在问题,因为这会催生过度保守的策略,在实际应用中表现欠佳。为在这两种极端之间架起桥梁,我们提出基于贝叶斯鲁棒优化的模仿学习(BROIL)。BROIL利用贝叶斯奖励函数推断与用户特定的风险容忍度,高效优化能平衡预期回报与条件风险价值的鲁棒策略。实证结果表明,BROIL提供了一种在收益最大化与风险最小化行为之间自然插值的途径,其性能优于现有风险敏感与风险中性逆强化学习算法。代码已开源:https://github.com/dsbrown1331/broil。