Batch reinforcement learning (RL) aims at leveraging pre-collected data to find an optimal policy that maximizes the expected total rewards in a dynamic environment. The existing methods require absolutely continuous assumption (e.g., there do not exist non-overlapping regions) on the distribution induced by target policies with respect to the data distribution over either the state or action or both. We propose a new batch RL algorithm that allows for singularity for both state and action spaces (e.g., existence of non-overlapping regions between offline data distribution and the distribution induced by the target policies) in the setting of an infinite-horizon Markov decision process with continuous states and actions. We call our algorithm STEEL: SingulariTy-awarE rEinforcement Learning. Our algorithm is motivated by a new error analysis on off-policy evaluation, where we use maximum mean discrepancy, together with distributionally robust optimization, to characterize the error of off-policy evaluation caused by the possible singularity and to enable model extrapolation. By leveraging the idea of pessimism and under some technical conditions, we derive a first finite-sample regret guarantee for our proposed algorithm under singularity. Compared with existing algorithms,by requiring only minimal data-coverage assumption, STEEL improves the applicability and robustness of batch RL. In addition, a two-step adaptive STEEL, which is nearly tuning-free, is proposed. Extensive simulation studies and one (semi)-real experiment on personalized pricing demonstrate the superior performance of our methods in dealing with possible singularity in batch RL.
翻译:批量强化学习旨在利用预先收集的数据,在动态环境中寻找能够最大化期望总奖励的最优策略。现有方法要求目标策略诱导的分布在状态空间或动作空间(或两者)上相对于数据分布满足绝对连续性假设(例如不存在非重叠区域)。本文提出一种新的批量强化学习算法,该算法允许在具有连续状态和动作的无限时域马尔可夫决策过程设定下,状态空间与动作空间均存在奇异性(例如离线数据分布与目标策略诱导分布之间存在非重叠区域)。我们将该算法命名为STEEL:奇异性感知强化学习。本算法的动机源于对离策略评估的新误差分析,其中我们结合最大均值差异与分布鲁棒优化,以刻画由潜在奇异性引起的离策略评估误差,并实现模型外推。通过利用悲观主义思想并在一定技术条件下,我们首次为所提算法在奇异性存在的情况下推导出有限样本遗憾保证。与现有算法相比,STEEL仅需极弱的数据覆盖假设,从而提升了批量强化学习的适用性与鲁棒性。此外,我们提出了一种近乎无需调参的两步自适应STEEL算法。大量仿真研究以及一项关于个性化定价的半真实实验表明,本方法在处理批量强化学习中潜在奇异性方面具有优越性能。