We study online reinforcement learning in linear Markov decision processes with adversarial losses and bandit feedback, without prior knowledge on transitions or access to simulators. We introduce two algorithms that achieve improved regret performance compared to existing approaches. The first algorithm, although computationally inefficient, ensures a regret of $\widetilde{\mathcal{O}}\left(\sqrt{K}\right)$, where $K$ is the number of episodes. This is the first result with the optimal $K$ dependence in the considered setting. The second algorithm, which is based on the policy optimization framework, guarantees a regret of $\widetilde{\mathcal{O}}\left(K^{\frac{3}{4}} \right)$ and is computationally efficient. Both our results significantly improve over the state-of-the-art: a computationally inefficient algorithm by Kong et al. [2023] with $\widetilde{\mathcal{O}}\left(K^{\frac{4}{5}}+poly\left(\frac{1}{\lambda_{\min}}\right) \right)$ regret, for some problem-dependent constant $\lambda_{\min}$ that can be arbitrarily close to zero, and a computationally efficient algorithm by Sherman et al. [2023b] with $\widetilde{\mathcal{O}}\left(K^{\frac{6}{7}} \right)$ regret.
翻译:我们研究了在损失对抗且仅能获取bandit反馈的线性马尔可夫决策过程中的在线强化学习问题,无需事先了解转移概率或访问模拟器。我们提出了两种算法,相较于现有方法实现了更优的遗憾性能。第一种算法尽管计算效率不高,但确保了$\widetilde{\mathcal{O}}\left(\sqrt{K}\right)$的遗憾,其中$K$为回合数。这是在所考虑设定中首个达到最优$K$依赖性的结果。第二种算法基于策略优化框架,保证了$\widetilde{\mathcal{O}}\left(K^{\frac{3}{4}}\right)$的遗憾,且具有计算高效性。我们的两项结果均显著优于当前最优方法:Kong等人[2023]提出的计算低效算法其遗憾为$\widetilde{\mathcal{O}}\left(K^{\frac{4}{5}}+poly\left(\frac{1}{\lambda_{\min}}\right)\right)$,其中$\lambda_{\min}$为可任意接近零的问题相关常数;Sherman等人[2023b]提出的计算高效算法其遗憾为$\widetilde{\mathcal{O}}\left(K^{\frac{6}{7}}\right)$。