We propose a new regret minimization algorithm for episodic sparse linear Markov decision process (SMDP) where the state-transition distribution is a linear function of observed features. The only previously known algorithm for SMDP requires the knowledge of the sparsity parameter and oracle access to an unknown policy. We overcome these limitations by combining the doubly robust method that allows one to use feature vectors of \emph{all} actions with a novel analysis technique that enables the algorithm to use data from all periods in all episodes. The regret of the proposed algorithm is $\tilde{O}(\sigma^{-1}_{\min} s_{\star} H \sqrt{N})$, where $\sigma_{\min}$ denotes the restrictive the minimum eigenvalue of the average Gram matrix of feature vectors, $s_\star$ is the sparsity parameter, $H$ is the length of an episode, and $N$ is the number of rounds. We provide a lower regret bound that matches the upper bound up to logarithmic factors on a newly identified subclass of SMDPs. Our numerical experiments support our theoretical results and demonstrate the superior performance of our algorithm.
翻译:我们提出了一种新的遗憾最小化算法,用于处理情节式稀疏线性马尔可夫决策过程(SMDP),其中状态转移分布是观测特征的线性函数。此前唯一已知的SMDP算法需要知道稀疏性参数,并能对未知策略进行预言机访问。我们通过结合双重稳健方法(允许使用所有动作的特征向量)与一种新颖的分析技术(使得算法能在所有情节的整个时段内利用数据)克服了这些限制。所提出算法的遗憾为 $\tilde{O}(\sigma^{-1}_{\min} s_{\star} H \sqrt{N})$,其中 $\sigma_{\min}$ 表示特征向量平均格拉姆矩阵的限制最小特征值,$s_\star$ 是稀疏性参数,$H$ 是每个情节的长度,$N$ 是回合数。我们在新识别的SMDP子类上给出了一个下界,该下界与上界在对数因子范围内匹配。我们的数值实验支持理论结果,并展示了该算法的优越性能。