A standard assumption in Reinforcement Learning is that the agent observes every visited state-action pair in the associated Markov Decision Process (MDP), along with the per-step rewards. Strong theoretical results are known in this setting, achieving nearly-tight $Θ(\sqrt{T})$-regret bounds. However, such detailed feedback can be unrealistic, and recent research has investigated more restricted settings such as trajectory feedback, where the agent observes all the visited state-action pairs, but only a single \emph{aggregate} reward. In this paper, we consider a far more restrictive ``fully bandit'' feedback model for episodic MDPs, where the agent does not even observe the visited state-action pairs -- it only learns the aggregate reward. We provide the first efficient bandit learning algorithm for episodic MDPs with $\widetilde{O}(\sqrt{T})$ regret. Our regret has an exponential dependence on the horizon length $\H$, which we show is necessary. We also obtain improved nearly-tight regret bounds for ``ordered'' MDPs; these can be used to model classical stochastic optimization problems such as $k$-item prophet inequality and sequential posted pricing. Finally, we evaluate the empirical performance of our algorithm for the setting of $k$-item prophet inequalities; despite the highly restricted feedback, our algorithm's performance is comparable to that of a state-of-art learning algorithm (UCB-VI) with detailed state-action feedback.
翻译:强化学习中的一个标准假设是,智能体能够观察到马尔可夫决策过程(MDP)中访问的每一个状态-动作对以及每一步的即时奖励。在此设定下,已有强大的理论结果,实现了近乎紧致的 $Θ(\sqrt{T})$ 遗憾界。然而,如此详尽的反馈可能不切实际,近期研究已开始探索更受限的设定,例如轨迹反馈,其中智能体虽能观察到所有访问过的状态-动作对,但仅获得一个单一的\emph{聚合}奖励。本文针对片段式MDP,研究一种限制性更强的“完全带反馈”模型,其中智能体甚至无法观察到访问的状态-动作对——它仅能获知聚合奖励。我们提出了首个高效的带反馈学习算法,用于片段式MDP,其遗憾为 $\widetilde{O}(\sqrt{T})$。我们的遗憾界对片段长度 $\H$ 具有指数依赖关系,我们证明了这是必要的。同时,我们为“有序”MDP获得了改进的近乎紧致的遗憾界;这些结果可用于建模经典的随机优化问题,如 $k$ 项先知不等式和序列定价问题。最后,我们评估了算法在 $k$ 项先知不等式设定下的实证性能;尽管反馈信息高度受限,我们算法的性能与具有详尽状态-动作反馈的先进学习算法(UCB-VI)相当。