Reinforcement learning (RL) algorithms typically deal with maximizing the expected cumulative return (discounted or undiscounted, finite or infinite horizon). However, several crucial applications in the real world, such as drug discovery, do not fit within this framework because an RL agent only needs to identify states (molecules) that achieve the highest reward within a trajectory and does not need to optimize for the expected cumulative return. In this work, we formulate an objective function to maximize the expected maximum reward along a trajectory, derive a novel functional form of the Bellman equation, introduce the corresponding Bellman operators, and provide a proof of convergence. Using this formulation, we achieve state-of-the-art results on the task of molecule generation that mimics a real-world drug discovery pipeline.
翻译:强化学习算法通常处理最大化期望累积回报(折扣或非折扣、有限或无限时域)的问题。然而,现实世界中若干关键应用(如药物发现)并不符合这一框架,因为强化学习智能体仅需在轨迹中识别实现最高回报的状态(分子),而无需优化期望累积回报。本研究提出一个最大化轨迹期望最大回报的目标函数,推导出贝尔曼方程的新函数形式,引入相应的贝尔曼算子,并给出收敛性证明。基于该形式化方法,我们在模拟真实药物发现流程的分子生成任务上取得了当前最优结果。