This paper proposes a computationally tractable algorithm for learning infinite-horizon average-reward linear Markov decision processes (MDPs) and linear mixture MDPs under the Bellman optimality condition. While guaranteeing computational efficiency, our algorithm for linear MDPs achieves the best-known regret upper bound of $\widetilde{\mathcal{O}}(d^{3/2}\mathrm{sp}(v^*)\sqrt{T})$ over $T$ time steps where $\mathrm{sp}(v^*)$ is the span of the optimal bias function $v^*$ and $d$ is the dimension of the feature mapping. For linear mixture MDPs, our algorithm attains a regret bound of $\widetilde{\mathcal{O}}(d\cdot\mathrm{sp}(v^*)\sqrt{T})$. The algorithm applies novel techniques to control the covering number of the value function class and the span of optimistic estimators of the value function, which is of independent interest.
翻译:本文提出了一种计算可行的算法,用于在贝尔曼最优性条件下学习无界时域平均奖励线性马尔可夫决策过程(MDP)和线性混合MDP。在保证计算效率的同时,我们针对线性MDP的算法在$T$个时间步内实现了目前已知最优的遗憾上界$\widetilde{\mathcal{O}}(d^{3/2}\mathrm{sp}(v^*)\sqrt{T})$,其中$\mathrm{sp}(v^*)$是最优偏置函数$v^*$的跨度,$d$是特征映射的维度。对于线性混合MDP,我们的算法达到了$\widetilde{\mathcal{O}}(d\cdot\mathrm{sp}(v^*)\sqrt{T})$的遗憾界。该算法应用了创新技术来控制值函数类的覆盖数以及值函数乐观估计量的跨度,这些技术本身具有独立的研究价值。