We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses in the bandit feedback setting. Specifically, we focus on linear mixture MDPs whose transition kernel is a linear mixture model. We propose a new algorithm that attains an $\widetilde{O}(d\sqrt{HS^3K} + \sqrt{HSAK})$ regret with high probability, where $d$ is the dimension of feature mappings, $S$ is the size of state space, $A$ is the size of action space, $H$ is the episode length and $K$ is the number of episodes. Our result strictly improves the previous best-known $\widetilde{O}(dS^2 \sqrt{K} + \sqrt{HSAK})$ result in Zhao et al. (2023a) since $H \leq S$ holds by the layered MDP structure. Our advancements are primarily attributed to (i) a new least square estimator for the transition parameter that leverages the visit information of all states, as opposed to only one state in prior work, and (ii) a new self-normalized concentration tailored specifically to handle non-independent noises, originally proposed in the dynamic assortment area and firstly applied in reinforcement learning to handle correlations between different states.
翻译:我们研究了在bandit反馈设定下,具有线性函数逼近、未知转移和对抗性损失的强化学习问题。具体而言,我们聚焦于转移核为线性混合模型的线性混合MDP。我们提出了一种新算法,以高概率实现了$\widetilde{O}(d\sqrt{HS^3K} + \sqrt{HSAK})$的遗憾值,其中$d$为特征映射维度,$S$为状态空间大小,$A$为动作空间大小,$H$为回合长度,$K$为回合数。由于分层MDP结构保证了$H \leq S$,我们的结果严格优于Zhao等人(2023a)此前最优的$\widetilde{O}(dS^2 \sqrt{K} + \sqrt{HSAK})$结果。我们的进展主要归功于:(i) 一种新型的转移参数最小二乘估计器,该估计器利用所有状态的访问信息(而非先前工作中仅利用单一状态),以及(ii) 一种专门用于处理非独立噪声的新型自归一化集中不等式,该不等式最初源于动态分类领域,并首次被应用于强化学习以处理不同状态间的相关性。