We consider learning in an adversarial Markov Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes and the state space can be arbitrarily large. We assume that the Q-function of any policy is linear in some known features, that is, a linear function approximation exists. The best existing regret upper bound for this setting (Luo et al., 2021) is of order $\tilde{\mathcal O}(K^{2/3})$ (omitting all other dependencies), given access to a simulator. This paper provides two algorithms that improve the regret to $\tilde{\mathcal O}(\sqrt K)$ in the same setting. Our first algorithm makes use of a refined analysis of the Follow-the-Regularized-Leader (FTRL) algorithm with the log-barrier regularizer. This analysis allows the loss estimators to be arbitrarily negative and might be of independent interest. Our second algorithm develops a magnitude-reduced loss estimator, further removing the polynomial dependency on the number of actions in the first algorithm and leading to the optimal regret bound (up to logarithmic terms and dependency on the horizon). Moreover, we also extend the first algorithm to simulator-free linear MDPs, which achieves $\tilde{\mathcal O}(K^{8/9})$ regret and greatly improves over the best existing bound $\tilde{\mathcal O}(K^{14/15})$. This algorithm relies on a better alternative to the Matrix Geometric Resampling procedure by Neu & Olkhovskaya (2020), which could again be of independent interest.
翻译:我们考虑在对抗性马尔可夫决策过程(MDP)中的学习问题,其中损失函数可在$K$幕中任意变化,且状态空间可任意大。我们假设任意策略的Q函数在已知特征下是线性的,即存在线性函数逼近。在该设置下(假定可访问模拟器),现有最佳遗憾上界(Luo等人,2021)为$\tilde{\mathcal O}(K^{2/3})$量级(忽略所有其他依赖性)。本文提出两种算法,将相同设置下的遗憾改进至$\tilde{\mathcal O}(\sqrt K)$。第一种算法采用带对数障碍正则化的Follow-the-Regularized-Leader (FTRL)算法的精化分析。该分析允许损失估计量为任意负值,可能具有独立研究价值。第二种算法开发了幅度缩减的损失估计量,进一步消除了第一种算法中对动作数量的多项式依赖性,并达到最优遗憾界(除对数项和水平依赖项外)。此外,我们将第一种算法扩展至无模拟器的线性MDP,实现了$\tilde{\mathcal O}(K^{8/9})$的遗憾,显著优于现有最佳界$\tilde{\mathcal O}(K^{14/15})$。该算法依赖于对Neu & Olkhovskaya(2020)的矩阵几何重采样过程的更优替代方案,该方案同样可能具有独立研究价值。