Recent studies in reinforcement learning (RL) have made significant progress by leveraging function approximation to alleviate the sample complexity hurdle for better performance. Despite the success, existing provably efficient algorithms typically rely on the accessibility of immediate feedback upon taking actions. The failure to account for the impact of delay in observations can significantly degrade the performance of real-world systems due to the regret blow-up. In this work, we tackle the challenge of delayed feedback in RL with linear function approximation by employing posterior sampling, which has been shown to empirically outperform the popular UCB algorithms in a wide range of regimes. We first introduce Delayed-PSVI, an optimistic value-based algorithm that effectively explores the value function space via noise perturbation with posterior sampling. We provide the first analysis for posterior sampling algorithms with delayed feedback in RL and show our algorithm achieves $\widetilde{O}(\sqrt{d^3H^3 T} + d^2H^2 E[\tau])$ worst-case regret in the presence of unknown stochastic delays. Here $E[\tau]$ is the expected delay. To further improve its computational efficiency and to expand its applicability in high-dimensional RL problems, we incorporate a gradient-based approximate sampling scheme via Langevin dynamics for Delayed-LPSVI, which maintains the same order-optimal regret guarantee with $\widetilde{O}(dHK)$ computational cost. Empirical evaluations are performed to demonstrate the statistical and computational efficacy of our algorithms.
翻译:近期强化学习研究通过利用函数近似降低样本复杂度以提升性能,取得了显著进展。然而,现有可证明高效算法通常依赖于行动后即时反馈的可获取性。忽略观测延迟的影响会因累积遗憾激增而严重降低实际系统的性能。本研究采用后验采样方法应对线性函数近似下强化学习中的延迟反馈挑战——该方法已被证明在广泛场景中性能优于流行的UCB算法。我们首先提出Delayed-PSVI算法,这是一种通过噪声扰动实现后验采样的乐观值函数探索算法。我们首次给出了带延迟反馈的强化学习后验采样算法分析,证明在未知随机延迟条件下,该算法可实现$\widetilde{O}(\sqrt{d^3H^3 T} + d^2H^2 E[\tau])$的最坏情况累积遗憾,其中$E[\tau]$表示期望延迟。为提升计算效率并扩展其在高维强化学习问题中的适用性,我们进一步引入基于朗之万动力学的梯度近似采样方案构建Delayed-LPSVI算法,该算法在保持相同序最优遗憾保证的同时将计算复杂度降至$\widetilde{O}(dHK)$。通过实证评估验证了所提算法在统计计算两方面的有效性。