Inverse reinforcement learning (IRL) algorithms often rely on (forward) reinforcement learning or planning over a given time horizon to compute an approximately optimal policy for a hypothesized reward function and then match this policy with expert demonstrations. The time horizon plays a critical role in determining both the accuracy of reward estimate and the computational efficiency of IRL algorithms. Interestingly, an effective time horizon shorter than the ground-truth value often produces better results faster. This work formally analyzes this phenomenon and provides an explanation: the time horizon controls the complexity of an induced policy class and mitigates overfitting with limited data. This analysis leads to a principled choice of the effective horizon for IRL. It also prompts us to reexamine the classic IRL formulation: it is more natural to learn jointly the reward and the effective horizon together rather than the reward alone with a given horizon. Our experimental results confirm the theoretical analysis.
翻译:逆强化学习(IRL)算法通常依赖(正向)强化学习或规划在给定时间范围内为假设的奖励函数计算近似最优策略,并将该策略与专家演示进行匹配。时间范围在决定奖励估计的准确性及IRL算法计算效率方面起着关键作用。有趣的是,相较于真实值,采用较短的有效时间范围往往能更快地产生更优结果。本文对这一现象进行正式分析并提出解释:时间范围控制着诱导策略类的复杂度,并在有限数据条件下缓解过拟合问题。该分析为IRL有效时间范围的选择提供了理论依据,同时促使我们重新审视经典IRL框架——相较于在给定时间范围下单独学习奖励函数,联合学习奖励与有效时间范围更为自然。实验结果验证了理论分析。