We derive a novel PAC-Bayesian generalization bound for reinforcement learning that explicitly accounts for Markov dependencies in the data, through the chain's mixing time. This contributes to overcoming challenges in obtaining generalization guarantees for reinforcement learning, where the sequential nature of data breaks the independence assumptions underlying classical bounds. Our bound provides non-vacuous certificates for modern off-policy algorithms like Soft Actor-Critic. We demonstrate the bound's practical utility through PB-SAC, a novel algorithm that optimizes the bound during training to guide exploration. Experiments across continuous control tasks show that our approach provides meaningful confidence certificates while maintaining competitive performance.
翻译:我们推导了一个新颖的PAC-Bayesian强化学习泛化界,该界通过马尔可夫链的混合时间显式地考虑了数据中的马尔可夫依赖性。这有助于克服为强化学习获取泛化保证的挑战,在强化学习中,数据的序列性质打破了经典泛化界所依赖的独立性假设。我们的泛化界为Soft Actor-Critic等现代离策略算法提供了非平凡的保证。我们通过PB-SAC算法证明了该界的实际效用,PB-SAC是一种在训练期间通过优化该界来指导探索的新算法。在连续控制任务上的实验表明,我们的方法在保持竞争性性能的同时,提供了有意义的置信度保证。