Randomized Exploration for Reinforcement Learning with Multinomial Logistic Function Approximation

We study reinforcement learning with multinomial logistic (MNL) function approximation where the underlying transition probability kernel of the Markov decision processes (MDPs) is parametrized by an unknown transition core with features of state and action. For the finite horizon episodic setting with inhomogeneous state transitions, we propose provably efficient algorithms with randomized exploration having frequentist regret guarantees. For our first algorithm, $\texttt{RRL-MNL}$, we adapt optimistic sampling to ensure the optimism of the estimated value function with sufficient frequency. We establish that $\texttt{RRL-MNL}$ achieves a $\tilde{O}(\kappa^{-1} d^{\frac{3}{2}} H^{\frac{3}{2}} \sqrt{T})$ frequentist regret bound with constant-time computational cost per episode. Here, $d$ is the dimension of the transition core, $H$ is the horizon length, $T$ is the total number of steps, and $\kappa$ is a problem-dependent constant. Despite the simplicity and practicality of $\texttt{RRL-MNL}$, its regret bound scales with $\kappa^{-1}$, which is potentially large in the worst case. To improve the dependence on $\kappa^{-1}$, we propose $\texttt{ORRL-MNL}$, which estimates the value function using the local gradient information of the MNL transition model. We show that its frequentist regret bound is $\tilde{O}(d^{\frac{3}{2}} H^{\frac{3}{2}} \sqrt{T} + \kappa^{-1} d^2 H^2)$. To the best of our knowledge, these are the first randomized RL algorithms for the MNL transition model that achieve statistical guarantees with constant-time computational cost per episode. Numerical experiments demonstrate the superior performance of the proposed algorithms.

翻译：我们研究基于多项式逻辑（MNL）函数逼近的强化学习，其中马尔可夫决策过程（MDP）的底层转移概率核由具有状态和动作特征的未知转移核参数化。针对具有非齐次状态转移的有限时间幕式设定，我们提出了具有频率主义遗憾保证的随机探索可证明高效算法。对于我们的第一个算法 $\texttt{RRL-MNL}$，我们采用乐观采样以确保估计值函数以足够频率保持乐观性。我们证明 $\texttt{RRL-MNL}$ 实现了 $\tilde{O}(\kappa^{-1} d^{\frac{3}{2}} H^{\frac{3}{2}} \sqrt{T})$ 的频率主义遗憾界，且每幕计算成本为常数时间。其中 $d$ 是转移核的维度，$H$ 是时间跨度长度，$T$ 是总步数，$\kappa$ 是问题相关常数。尽管 $\texttt{RRL-MNL}$ 具有简单性和实用性，其遗憾界与 $\kappa^{-1}$ 相关，在最坏情况下可能很大。为改进对 $\kappa^{-1}$ 的依赖，我们提出 $\texttt{ORRL-MNL}$，该算法利用 MNL 转移模型的局部梯度信息估计值函数。我们证明其频率主义遗憾界为 $\tilde{O}(d^{\frac{3}{2}} H^{\frac{3}{2}} \sqrt{T} + \kappa^{-1} d^2 H^2)$。据我们所知，这些是针对 MNL 转移模型的首批随机强化学习算法，在每幕常数时间计算成本下实现了统计保证。数值实验证明了所提算法的优越性能。