We study reinforcement learning with delayed state observation, where the agent observes the current state after some random number of time steps. We propose an algorithm that combines the augmentation method and the upper confidence bound approach. For tabular Markov decision processes (MDPs), we derive a regret bound of $\tilde{\mathcal{O}}(H \sqrt{D_{\max} SAK})$, where $S$ and $A$ are the cardinalities of the state and action spaces, $H$ is the time horizon, $K$ is the number of episodes, and $D_{\max}$ is the maximum length of the delay. We also provide a matching lower bound up to logarithmic factors, showing the optimality of our approach. Our analytical framework formulates this problem as a special case of a broader class of MDPs, where their transition dynamics decompose into a known component and an unknown but structured component. We establish general results for this abstract setting, which may be of independent interest.
翻译:我们研究了状态观测延迟的强化学习问题,其中智能体在经历随机时间步后观测当前状态。我们提出了一种结合增广方法和上置信界方法的算法。对于表格型马尔可夫决策过程(MDPs),我们推导出遗憾界为 $\tilde{\mathcal{O}}(H \sqrt{D_{\max} SAK})$,其中 $S$ 和 $A$ 分别为状态空间和动作空间的基数,$H$ 为时间范围,$K$ 为回合数,$D_{\max}$ 为最大延迟长度。我们还给出了匹配的对数因子下界,证明了该方法的渐近最优性。我们的分析框架将这一问题表述为一类更广泛MDPs的特例,其中转移动态可分解为已知分量和未知但有结构的分量。我们为该抽象设定建立了通用结果,这些结果可能具有独立的研究价值。