Dealing with Partially Observable Markov Decision Processes is notably a challenging task. We face an average-reward infinite-horizon POMDP setting with an unknown transition model, where we assume the knowledge of the observation model. Under this assumption, we propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy. Then, we propose the OAS-UCRL algorithm that implicitly balances the exploration-exploitation trade-off following the $\textit{optimism in the face of uncertainty}$ principle. The algorithm runs through episodes of increasing length. For each episode, the optimal belief-based policy of the estimated POMDP interacts with the environment and collects samples that will be used in the next episode by the OAS estimation procedure to compute a new estimate of the POMDP parameters. Given the estimated model, an optimization oracle computes the new optimal policy. We show the consistency of the OAS procedure, and we prove a regret guarantee of order $\mathcal{O}(\sqrt{T \log(T)})$ for the proposed OAS-UCRL algorithm. We compare against the oracle playing the optimal stochastic belief-based policy and show the efficient scaling of our approach with respect to the dimensionality of the state, action, and observation space. We finally conduct numerical simulations to validate and compare the proposed technique with other baseline approaches.
翻译:处理部分可观测马尔可夫决策过程是一项公认的挑战性任务。我们研究一个具有未知转移模型的平均奖励无限时域POMDP设定,并假设观测模型已知。在此假设下,我们提出了观测感知谱估计技术,该技术使得POMDP参数能够通过基于信念的策略所收集的样本进行学习。随后,我们提出了OAS-UCRL算法,该算法遵循“面对不确定性的乐观主义”原则,隐式地平衡了探索与利用的权衡。该算法以逐段递增的周期运行。在每个周期中,估计POMDP的最优信念策略与环境交互,并收集样本供下一周期的OAS估计程序用于计算新的POMDP参数估计值。基于估计模型,优化预言机计算出新的最优策略。我们证明了OAS程序的一致性,并为所提出的OAS-UCRL算法建立了$\mathcal{O}(\sqrt{T \log(T)})$阶的遗憾上界。通过与执行最优随机信念策略的预言机进行对比,我们展示了该方法在状态、动作及观测空间维度上的高效扩展性。最后,我们通过数值模拟验证了所提技术,并与其他基线方法进行了比较。