We study reward-free and reward-agnostic exploration in episodic finite-horizon Markov decision processes (MDPs), where an agent explores an unknown environment without observing external rewards. Reward-free exploration aims to enable $ε$-optimal policies for any reward revealed after exploration, while reward-agnostic exploration targets $ε$-optimality for rewards drawn from a small finite class. In the reward-agnostic setting, Li, Yan, Chen, and Fan achieve minimax sample complexity, but only for restrictively small accuracy parameter $ε$. We propose a new algorithm that significantly relaxes the requirement on $ε$. Our approach is novel and of technical interest by itself. Our algorithm employs an online learning procedure with carefully designed rewards to construct an exploration policy, which is used to gather data sufficient for accurate dynamics estimation and subsequent computation of an $ε$-optimal policy once the reward is revealed. Finally, we establish a tight lower bound for reward-free exploration, closing the gap between known upper and lower bounds.
翻译:我们研究在有限时段马尔可夫决策过程(MDPs)中的无奖励与奖励未知探索问题,其中智能体在未观测外部奖励的情况下探索未知环境。无奖励探索旨在为探索后揭示的任何奖励实现$ε$最优策略,而奖励未知探索则针对从小规模有限类中抽取的奖励实现$ε$最优性。在奖励未知设定中,Li、Yan、Chen和Fan实现了极小极大样本复杂度,但仅适用于限制性极小的精度参数$ε$。我们提出了一种新算法,显著放宽了对$ε$的要求。我们的方法本身具有新颖性和技术价值。该算法采用在线学习过程,通过精心设计的奖励构建探索策略,用于收集足以进行精确动态估计的数据,并在奖励揭示后计算$ε$最优策略。最后,我们建立了无奖励探索的紧致下界,从而填补了已知上下界之间的空白。