Recent studies have shown that episodic reinforcement learning (RL) is no harder than bandits when the total reward is bounded by $1$, and proved regret bounds that have a polylogarithmic dependence on the planning horizon $H$. However, it remains an open question that if such results can be carried over to adversarial RL, where the reward is adversarially chosen at each episode. In this paper, we answer this question affirmatively by proposing the first horizon-free policy search algorithm. To tackle the challenges caused by exploration and adversarially chosen reward, our algorithm employs (1) a variance-uncertainty-aware weighted least square estimator for the transition kernel; and (2) an occupancy measure-based technique for the online search of a \emph{stochastic} policy. We show that our algorithm achieves an $\tilde{O}\big((d+\log (|\mathcal{S}|^2 |\mathcal{A}|))\sqrt{K}\big)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|\mathcal{S}|$ and $|\mathcal{A}|$ are the cardinalities of the state and action spaces. We also provide hardness results and regret lower bounds to justify the near optimality of our algorithm and the unavoidability of $\log|\mathcal{S}|$ and $\log|\mathcal{A}|$ in the regret bound.
翻译:近期研究表明,当总奖励被约束在$1$范围内时,片段式强化学习(RL)的难度并不高于赌博机问题,并证明了后悔界对规划视野$H$仅具有多对数依赖关系。然而,这些结论能否推广至每回合奖励由对手动态选择的对抗性强化学习仍是一个开放性问题。本文通过提出首个无视野策略搜索算法,对此问题给出肯定回答。为应对探索与对抗性奖励带来的挑战,本算法采用:(1)基于方差-不确定性感知的加权最小二乘估计器来估计转移核;(2)基于占用度量的在线搜索技术以寻找\emph{随机}策略。我们证明,在完全信息反馈下,该算法可实现$\tilde{O}\big((d+\log (|\mathcal{S}|^2 |\mathcal{A}|))\sqrt{K}\big)$的后悔界,其中$d$为对MDP未知转移核进行线性参数化的已知特征映射维度,$K$为片段总数,$|\mathcal{S}|$与$|\mathcal{A}|$分别为状态空间与动作空间的基数。我们还通过困难度证明与后悔下界,验证了该算法的近最优性,并阐明$\log|\mathcal{S}|$与$\log|\mathcal{A}|$在后悔界中的不可避免性。