Parametric, feature-based reward models are employed by a variety of algorithms in decision-making settings such as bandits and Markov decision processes (MDPs). The typical assumption under which the algorithms are analysed is realizability, i.e., that the true values of actions are perfectly explained by some parametric model in the class. We are, however, interested in the situation where the true values are (significantly) misspecified with respect to the model class. For parameterized bandits, contextual bandits and MDPs, we identify structural conditions, depending on the problem instance and model class, under which basic algorithms such as $\epsilon$-greedy, LinUCB and fitted Q-learning provably learn optimal policies under even highly misspecified models. This is in contrast to existing worst-case results for, say misspecified bandits, which show regret bounds that scale linearly with time, and shows that there can be a nontrivially large set of bandit instances that are robust to misspecification.
翻译:参数化、基于特征的奖励模型在赌博机与马尔可夫决策过程等决策场景中被多种算法采用。分析这些算法的典型假设是可实现性,即动作的真实值可被模型类中的某个参数模型完全解释。然而,我们关注的是真实值相对于模型类存在(显著)误设的情形。针对参数化赌博机、上下文赌博机与马尔可夫决策过程,我们识别出依赖于问题实例与模型类的结构条件,在这些条件下,即使模型高度误设,诸如ε-贪心、LinUCB与拟合Q学习等基础算法仍能可证明地学习到最优策略。这与现有针对误设赌博机的最差情况结果形成对比——后者显示遗憾界随时间线性增长——并表明存在一个非平凡的大规模赌博机实例集,对误设具有鲁棒性。