A long line of works characterizes the sample complexity of regret minimization in sequential decision-making by min-max programs. In the corresponding saddle-point game, the min-player optimizes the sampling distribution against an adversarial max-player that chooses confusing models leading to large regret. The most recent instantiation of this idea is the decision-estimation coefficient (DEC), which was shown to provide nearly tight lower and upper bounds on the worst-case expected regret in structured bandits and reinforcement learning. By re-parametrizing the offset DEC with the confidence radius and solving the corresponding min-max program, we derive an anytime variant of the Estimation-To-Decisions (E2D) algorithm. Importantly, the algorithm optimizes the exploration-exploitation trade-off online instead of via the analysis. Our formulation leads to a practical algorithm for finite model classes and linear feedback models. We further point out connections to the information ratio, decoupling coefficient and PAC-DEC, and numerically evaluate the performance of E2D on simple examples.
翻译:一系列研究工作通过极小化极大规划刻画了序贯决策中遗憾最小化的样本复杂度。在相应的鞍点博弈中,极小玩家优化采样分布以对抗选择导致大遗憾的混淆模型的对抗性极大玩家。该思想的最新实例是决策-估计系数(DEC),其被证明能在结构化的赌博机问题和强化学习中给出最坏情形期望遗憾的接近紧致下界与上界。通过使用置信半径对偏移DEC进行重参数化并求解相应的极小化极大规划,我们推导出"估计到决策"(E2D)算法的任意时间变体。重要的是,该算法在线优化探索-利用权衡而非通过分析实现。我们的公式化方法为有限模型类和线性反馈模型提供了实用算法。我们进一步指出与信息比、解耦系数和PAC-DEC的联系,并通过简单示例数值评估了E2D的性能。