We consider the problem of interactive decision making, encompassing structured bandits and reinforcement learning with general function approximation. Recently, Foster et al. (2021) introduced the Decision-Estimation Coefficient, a measure of statistical complexity that lower bounds the optimal regret for interactive decision making, as well as a meta-algorithm, Estimation-to-Decisions, which achieves upper bounds in terms of the same quantity. Estimation-to-Decisions is a reduction, which lifts algorithms for (supervised) online estimation into algorithms for decision making. In this paper, we show that by combining Estimation-to-Decisions with a specialized form of optimistic estimation introduced by Zhang (2022), it is possible to obtain guarantees that improve upon those of Foster et al. (2021) by accommodating more lenient notions of estimation error. We use this approach to derive regret bounds for model-free reinforcement learning with value function approximation, and give structural results showing when it can and cannot help more generally.
翻译:我们研究了交互决策问题,涵盖结构化的赌博机问题以及使用通用函数逼近的强化学习。近期,Foster等人(2021)引入了决策-估计系数,这是一种衡量交互决策统计复杂度的指标,能够为最优遗憾提供下界;同时他们提出了元算法“估计到决策”,该算法基于相同指标实现上界。估计到决策是一种归约方法,它将(监督式)在线估计算法提升为决策算法。本文表明,通过将估计到决策与Zhang(2022)提出的乐观估计的特定形式相结合,可以放宽对估计误差的定义要求,从而获得优于Foster等人(2021)的保证。我们利用这一方法推导了基于值函数逼近的无模型强化学习的遗憾界,并给出了结构性结果,阐明了该方法在更一般情况下何时有效以及何时无效。