This work advances randomized exploration in reinforcement learning (RL) with function approximation modeled by linear mixture MDPs. We establish the first prior-dependent Bayesian regret bound for RL with function approximation; and refine the Bayesian regret analysis for posterior sampling reinforcement learning (PSRL), presenting an upper bound of ${\mathcal{O}}(d\sqrt{H^3 T \log T})$, where $d$ represents the dimensionality of the transition kernel, $H$ the planning horizon, and $T$ the total number of interactions. This signifies a methodological enhancement by optimizing the $\mathcal{O}(\sqrt{\log T})$ factor over the previous benchmark (Osband and Van Roy, 2014) specified to linear mixture MDPs. Our approach, leveraging a value-targeted model learning perspective, introduces a decoupling argument and a variance reduction technique, moving beyond traditional analyses reliant on confidence sets and concentration inequalities to formalize Bayesian regret bounds more effectively.
翻译:本研究推进了基于线性混合MDP建模的带函数逼近强化学习中的随机探索。我们首次建立了带函数逼近强化学习的依赖先验的贝叶斯遗憾界,并优化了后验采样强化学习(PSRL)的贝叶斯遗憾分析,得到了上界${\mathcal{O}}(d\sqrt{H^3 T \log T})$,其中$d$表示转移核维度,$H$为规划时域,$T$为总交互次数。这一成果通过优化先前针对线性混合MDP的基准成果(Osband and Van Roy, 2014)中的$\mathcal{O}(\sqrt{\log T})$因子,实现了方法论上的提升。我们的方法基于价值导向的模型学习视角,引入了解耦论证与方差缩减技术,突破了传统依赖置信集与集中不等式的分析框架,更有效地形式化了贝叶斯遗憾界。