Recent studies in reinforcement learning (RL) have made significant progress by leveraging function approximation to alleviate the sample complexity hurdle for better performance. Despite the success, existing provably efficient algorithms typically rely on the accessibility of immediate feedback upon taking actions. The failure to account for the impact of delay in observations can significantly degrade the performance of real-world systems due to the regret blow-up. In this work, we tackle the challenge of delayed feedback in RL with linear function approximation by employing posterior sampling, which has been shown to empirically outperform the popular UCB algorithms in a wide range of regimes. We first introduce Delayed-PSVI, an optimistic value-based algorithm that effectively explores the value function space via noise perturbation with posterior sampling. We provide the first analysis for posterior sampling algorithms with delayed feedback in RL and show our algorithm achieves $\widetilde{O}(\sqrt{d^3H^3 T} + d^2H^2 E[\tau])$ worst-case regret in the presence of unknown stochastic delays. Here $E[\tau]$ is the expected delay. To further improve its computational efficiency and to expand its applicability in high-dimensional RL problems, we incorporate a gradient-based approximate sampling scheme via Langevin dynamics for Delayed-LPSVI, which maintains the same order-optimal regret guarantee with $\widetilde{O}(dHK)$ computational cost. Empirical evaluations are performed to demonstrate the statistical and computational efficacy of our algorithms.
翻译:近年来,强化学习研究通过利用函数逼近方法缓解样本复杂度瓶颈,在提升性能方面取得了显著进展。然而,现有可证明高效的算法通常依赖于行动后可即时获取反馈。若未考虑观测延迟的影响,将因遗憾值激增而严重降低实际系统的性能。本研究通过采用后验采样方法,解决了线性函数逼近强化学习中的延迟反馈挑战——该方法在多种场景下已被证明优于流行的UCB算法。我们首先提出Delayed-PSVI,这是一种基于乐观价值函数的算法,通过噪声扰动与后验采样有效探索价值函数空间。本文首次分析了存在延迟反馈时强化学习中的后验采样算法,证明该算法在未知随机延迟条件下可达到$\widetilde{O}(\sqrt{d^3H^3 T} + d^2H^2 E[\tau])$的最坏情形遗憾值,其中$E[\tau]$表示期望延迟。为提升计算效率并拓展至高维强化学习问题,我们进一步引入基于朗之万动力学的梯度近似采样方案构建Delayed-LPSVI,在保持相同阶数最优遗憾保证的同时将计算复杂度降至$\widetilde{O}(dHK)$。通过实验评估验证了所提算法在统计与计算两方面的效能。