We study reinforcement learning for episodic Markov Decision Processes (MDPs) whose transitions are modelled by a multinomial logistic (MNL) model. Existing algorithms for MNL mixture MDPs yield a regret of $\smash{\tilde{O}(dH^2\sqrt{T})}$ (Li et al., 2024), where $d$ is the feature dimension, $H$ the episode length, and $T$ the number of episodes. Inspired by the logistic bandit literature (Abeille et al., 2021; Faury et al., 2022; Boudart et al., 2026), we introduce a problem-dependent constant $\barσ\_T \leq 1/2$, measuring the normalised average variance of the optimal downstream value function along the learner's trajectory. We propose an algorithm achieving a regret of $\smash{\tilde{O}(dH^2\barσ\_T\sqrt{T})}$, which recovers the existing bound in the worst case and improves upon it for structured MDPs. For instance, for KL-constrained robust MDPs, $\barσ\_T = O(H^{-1})$, reducing the horizon dependence by a factor $H$. We further establish a matching $\smash{Ω(dH^2\barσ\_T\sqrt{T})}$ lower bound, proving minimax optimality (up to logarithmic factors) and fully characterising the regret complexity of MNL mixture MDPs for the first time.
翻译:我们研究具有多项式逻辑斯蒂(MNL)模型建模的状态转移的回合制马尔可夫决策过程(MDPs)。现有针对MNL混合MDP的算法可达到$\smash{\tilde{O}(dH^2\sqrt{T})}$的遗憾(Li等,2024),其中$d$为特征维度,$H$为回合长度,$T$为回合数。受逻辑斯蒂老虎机文献(Abeille等,2021;Faury等,2022;Boudart等,2026)启发,我们引入一个依赖于问题的常数$\barσ\_T \leq 1/2$,用于度量最优下游价值函数沿学习者轨迹的归一化平均方差。我们提出一种算法,其遗憾界为$\smash{\tilde{O}(dH^2\barσ\_T\sqrt{T})}$,在最坏情况下可恢复现有界,而对结构化MDP则有改进。例如,对KL约束鲁棒MDP,$\barσ\_T = O(H^{-1})$,可将序列依赖因子降低$H$倍。我们进一步证明匹配的下界$\smash{Ω(dH^2\barσ\_T\sqrt{T})}$,首次证明MNL混合MDP的最小最大最优性(忽略对数因子),并完全刻画其遗憾复杂度。