Reinforcement learning (RL) has achieved remarkable success in real-world decision-making across diverse domains, including gaming, robotics, online advertising, public health, and natural language processing. Despite these advances, a substantial gap remains between RL research and its deployment in many practical settings. Two recurring challenges often underlie this gap. First, many settings offer limited opportunity for the agent to interact extensively with the target environment due to practical constraints. Second, many target environments often undergo substantial changes, requiring redesign and redeployment of RL systems (e.g., advancements in science and technology that change the landscape of healthcare delivery). Addressing these challenges and bridging the gap between basic research and application requires theory and methodology that directly inform the design, implementation, and continual improvement of RL systems in real-world settings. In this paper, we frame the application of RL in practice as a three-component process: (i) online learning and optimization during deployment, (ii) post- or between-deployment offline analyses, and (iii) repeated cycles of deployment and redeployment to continually improve the RL system. We provide a narrative review of recent advances in statistical RL that address these components, including methods for maximizing data utility for between-deployment inference, enhancing sample efficiency for online learning within-deployment, and designing sequences of deployments for continual improvement. We also outline future research directions in statistical RL that are use-inspired -- aiming for impactful application of RL in practice.
翻译:强化学习(RL)在游戏、机器人、在线广告、公共卫生和自然语言处理等多个领域的现实世界决策中取得了显著成功。尽管取得了这些进展,但RL研究与其在许多实际场景中的部署之间仍存在巨大差距。这一差距背后往往存在两个反复出现的挑战。首先,由于实际限制,许多场景为智能体提供的与目标环境进行广泛交互的机会有限。其次,许多目标环境经常发生重大变化,需要重新设计和重新部署RL系统(例如,科技进步改变了医疗保健服务的格局)。要应对这些挑战并弥合基础研究与应用之间的差距,需要能够直接指导现实环境中RL系统的设计、实施和持续改进的理论与方法。在本文中,我们将RL在实践中的应用构建为一个包含三个组成部分的过程:(i)部署期间的在线学习与优化,(ii)部署后或部署间的离线分析,以及(iii)部署与重新部署的重复循环,以持续改进RL系统。我们对统计RL领域针对这些组成部分的最新进展进行了叙述性综述,包括最大化数据效用以进行部署间推断的方法、提高部署内在线学习样本效率的方法,以及为持续改进而设计部署序列的方法。我们还概述了统计RL中以应用为导向的未来研究方向——旨在推动RL在实践中的有效应用。