Despite the significant interest and progress in reinforcement learning (RL) problems with adversarial corruption, current works are either confined to the linear setting or lead to an undesired $\tilde{O}(\sqrt{T}\zeta)$ regret bound, where $T$ is the number of rounds and $\zeta$ is the total amount of corruption. In this paper, we consider the contextual bandit with general function approximation and propose a computationally efficient algorithm to achieve a regret of $\tilde{O}(\sqrt{T}+\zeta)$. The proposed algorithm relies on the recently developed uncertainty-weighted least-squares regression from linear contextual bandit and a new weighted estimator of uncertainty for the general function class. In contrast to the existing analysis that heavily relies on the linear structure, we develop a novel technique to control the sum of weighted uncertainty, thus establishing the final regret bounds. We then generalize our algorithm to the episodic MDP setting and first achieve an additive dependence on the corruption level $\zeta$ in the scenario of general function approximation. Notably, our algorithms achieve regret bounds either nearly match the performance lower bound or improve the existing methods for all the corruption levels and in both known and unknown $\zeta$ cases.
翻译:尽管强化学习(RL)中针对对抗性腐败问题的研究取得了显著进展,但现有工作要么局限于线性设定,要么导致非理想的 $\tilde{O}(\sqrt{T}\zeta)$ 遗憾界,其中 $T$ 为博弈轮数,$\zeta$ 为腐败总量。本文考虑具有通用函数逼近的上下文赌博机问题,提出一种计算高效的算法,实现 $\tilde{O}(\sqrt{T}+\zeta)$ 的遗憾值。该算法依赖于近期从线性上下文赌博机发展而来的不确定性加权最小二乘回归,以及针对通用函数类的新型不确定性加权估计器。与严重依赖线性结构的现有分析不同,我们开发了一种控制加权不确定性总和的创新技术,从而建立最终遗憾界。随后我们将算法推广至场景式马尔可夫决策过程,首次在通用函数逼近场景下实现与腐败水平 $\zeta$ 的加性依赖。值得注意的是,我们的算法在已知和未知 $\zeta$ 两种情况下,针对所有腐败水平实现的遗憾界要么几乎匹配性能下界,要么优于现有方法。