Although actor-critic methods have been successful in practice, their theoretical analyses have several limitations. Specifically, existing theoretical work either sidesteps the exploration problem by making strong assumptions or analyzes impractical methods with complicated algorithmic modifications. Moreover, the actor-critic methods analyzed for linear MDPs often employ natural policy gradient (NPG) and construct "implicit" policies without explicit parameterization. Such policies are computationally expensive to sample from, making the environment interactions inefficient. To that end, we focus on the finite-horizon linear MDPs and propose an optimistic actor-critic framework that uses parametric log-linear policies. In particular, we introduce a tractable \textit{logit-matching} regression objective for the actor. For the critic, we use approximate Thompson sampling via Langevin Monte Carlo to obtain optimistic value estimates. We prove that the resulting algorithm achieves $\widetilde{\mathcal{O}}(ε^{-4})$ and $\widetilde{\mathcal{O}}(ε^{-2})$ sample complexity in the on-policy and off-policy setting, respectively. Our results match prior theoretical works in achieving the state-of-the-art sample complexity, while our algorithm is more aligned with practice.
翻译:尽管Actor-Critic方法在实践中取得了成功,其理论分析仍存在若干局限性。具体而言,现有理论工作要么通过强假设规避探索问题,要么分析带有复杂算法修改的不可行方法。此外,针对线性MDPs的Actor-Critic方法常采用自然策略梯度(NPG)并构造不含显式参数化的"隐式"策略。此类策略的采样计算成本高昂,导致环境交互效率低下。为此,我们聚焦于有限时域线性MDPs,提出一种使用参数化对数线性策略的乐观Actor-Critic框架。具体地,我们为Actor引入了一个易于处理的\textit{logit匹配}回归目标函数。对于Critic,我们通过Langevin蒙特卡洛近似Thompson采样以获得乐观价值估计。我们证明,在在线策略和离线策略设置下,所提算法分别达到$\widetilde{\mathcal{O}}(ε^{-4})$和$\widetilde{\mathcal{O}}(ε^{-2})$样本复杂度。该结果与先前理论工作在实现最先进样本复杂度方面保持一致,同时我们的算法更贴合实际应用。