We tackle average-reward infinite-horizon POMDPs with an unknown transition model but a known observation model, a setting that has been previously addressed in two limiting ways: (i) frequentist methods relying on suboptimal stochastic policies having a minimum probability of choosing each action, and (ii) Bayesian approaches employing the optimal policy class but requiring strong assumptions about the consistency of employed estimators. Our work removes these limitations by proving convenient estimation guarantees for the transition model and introducing an optimistic algorithm that leverages the optimal class of deterministic belief-based policies. We introduce modifications to existing estimation techniques providing theoretical guarantees separately for each estimated action transition matrix. Unlike existing estimation methods that are unable to use samples from different policies, we present a novel and simple estimator that overcomes this barrier. This new data-efficient technique, combined with the proposed \emph{Action-wise OAS-UCRL} algorithm and a tighter theoretical analysis, leads to the first approach enjoying a regret guarantee of order $\mathcal{O}(\sqrt{T \,\log T})$ when compared against the optimal policy, thus improving over state of the art techniques. Finally, theoretical results are validated through numerical simulations showing the efficacy of our method against baseline methods.
翻译:我们研究了具有未知转移模型但已知观测模型的平均奖励无限时域部分可观测马尔可夫决策过程。该设定先前主要通过两种受限方式处理:(i) 依赖具有最小动作选择概率的次优随机策略的频率主义方法;(ii) 采用最优策略类但需对所用估计量一致性施加强假设的贝叶斯方法。我们的工作通过证明转移模型的便捷估计保证,并引入一种利用确定性信念最优策略类的乐观算法,消除了这些限制。我们对现有估计技术进行了改进,为每个估计的动作转移矩阵分别提供理论保证。与现有无法利用不同策略样本的估计方法不同,我们提出了一种新颖且简单的估计器以克服此障碍。这种新的数据高效技术,结合所提出的 \emph{Action-wise OAS-UCRL} 算法及更严密的理论分析,首次实现了与最优策略相比阶数为$\mathcal{O}(\sqrt{T \,\log T})$的遗憾保证,从而改进了现有技术水平。最后,数值模拟验证了理论结果,表明我们的方法相对于基线方法的有效性。