Inverse reinforcement learning (IRL) denotes a powerful family of algorithms for recovering a reward function justifying the behavior demonstrated by an expert agent. A well-known limitation of IRL is the ambiguity in the choice of the reward function, due to the existence of multiple rewards that explain the observed behavior. This limitation has been recently circumvented by formulating IRL as the problem of estimating the feasible reward set, i.e., the region of the rewards compatible with the expert's behavior. In this paper, we make a step towards closing the theory gap of IRL in the case of finite-horizon problems with a generative model. We start by formally introducing the problem of estimating the feasible reward set, the corresponding PAC requirement, and discussing the properties of particular classes of rewards. Then, we provide the first minimax lower bound on the sample complexity for the problem of estimating the feasible reward set of order ${\Omega}\Bigl( \frac{H^3SA}{\epsilon^2} \bigl( \log \bigl(\frac{1}{\delta}\bigl) + S \bigl)\Bigl)$, being $S$ and $A$ the number of states and actions respectively, $H$ the horizon, $\epsilon$ the desired accuracy, and $\delta$ the confidence. We analyze the sample complexity of a uniform sampling strategy (US-IRL), proving a matching upper bound up to logarithmic factors. Finally, we outline several open questions in IRL and propose future research directions.
翻译:逆向强化学习(IRL)是一类强大的算法族,旨在恢复能解释专家智能体所展现行为的奖励函数。IRL的一个众所周知局限性是奖励函数选择中的模糊性,这是由于存在多种能解释观察行为的奖励所致。近期,通过将IRL定义为估计可行奖励集(即与专家行为兼容的奖励区域)的问题,这一局限性得以规避。本文在生成模型框架下,针对有限时域问题,向填补IRL理论空白迈出了一步。我们首先正式引入可行奖励集估计问题、相应的PAC要求,并讨论特定奖励类别的性质。随后,我们给出了关于估计可行奖励集样本复杂度的首个极小化极大下界,形式为${\Omega}\Bigl( \frac{H^3SA}{\epsilon^2} \bigl( \log \bigl(\frac{1}{\delta}\bigl) + S \bigl)\Bigl)$,其中$S$和$A$分别表示状态与动作数量,$H$为时域,$\epsilon$为期望精度,$\delta$为置信度。我们分析了均匀采样策略(US-IRL)的样本复杂度,并证明了在忽略对数因子情况下的匹配上界。最后,我们概述了IRL中若干开放性问题,并提出了未来研究方向。