Deep Offline Reinforcement Learning for Real-World Treatment Optimization Applications

There is increasing interest in data-driven approaches for dynamically choosing optimal treatment strategies in many chronic disease management and critical care applications. Reinforcement learning methods are well-suited to this sequential decision-making problem, but must be trained and evaluated exclusively on retrospective medical record datasets as direct online exploration is unsafe and infeasible. Despite this requirement, the vast majority of dynamic treatment optimization studies use off-policy RL methods (e.g., Double Deep Q Networks (DDQN) or its variants) that are known to perform poorly in purely offline settings. Recent advances in offline RL, such as Conservative Q-Learning (CQL), offer a suitable alternative. But there remain challenges in adapting these approaches to real-world applications where suboptimal examples dominate the retrospective dataset and strict safety constraints need to be satisfied. In this work, we introduce a practical transition sampling approach to address action imbalance during offline RL training, and an intuitive heuristic to enforce hard constraints during policy execution. We provide theoretical analyses to show that our proposed approach would improve over CQL. We perform extensive experiments on two real-world tasks for diabetes and sepsis treatment optimization to compare performance of the proposed approach against prominent off-policy and offline RL baselines (DDQN and CQL). Across a range of principled and clinically relevant metrics, we show that our proposed approach enables substantial improvements in expected health outcomes and in consistency with relevant practice and safety guidelines.

翻译：在慢性病管理和重症监护等众多应用中，基于数据驱动的方法动态选择最优治疗策略日益受到关注。强化学习方法非常适合这种序贯决策问题，但由于直接在线探索不安全且不可行，必须仅基于回顾性医疗记录数据集进行训练和评估。尽管有此要求，绝大多数动态治疗优化研究仍采用离策略强化学习方法（如双深度Q网络（DDQN）或其变体），而这类方法在纯离线场景下表现不佳已被广泛认知。离线强化学习领域的最新进展（如保守Q学习（CQL））提供了合适的替代方案。但在将这些方法应用于现实世界场景时仍面临挑战：回顾性数据集中次优样本占主导地位，且需满足严格的安全约束。本文提出了一种实用的转移采样方法以解决离线强化学习训练中的动作不平衡问题，以及一种直观的启发式方法以在策略执行中施加硬约束。理论分析表明，所提方法相较于CQL具有显著改进。我们在糖尿病和脓毒症治疗优化两项现实任务上开展了大量实验，将所提方法与主流离策略及离线强化学习基线方法（DDQN和CQL）进行性能对比。在多个原则性且具有临床相关性的指标上，研究表明所提方法能显著提升预期健康结局，并增强与相关实践及安全指南的一致性。