There is increasing interest in data-driven approaches for recommending optimal treatment strategies in many chronic disease management and critical care applications. Reinforcement learning methods are well-suited to this sequential decision-making problem, but must be trained and evaluated exclusively on retrospective medical record datasets as direct online exploration is unsafe and infeasible. Despite this requirement, the vast majority of treatment optimization studies use off-policy RL methods (e.g., Double Deep Q Networks (DDQN) or its variants) that are known to perform poorly in purely offline settings. Recent advances in offline RL, such as Conservative Q-Learning (CQL), offer a suitable alternative. But there remain challenges in adapting these approaches to real-world applications where suboptimal examples dominate the retrospective dataset and strict safety constraints need to be satisfied. In this work, we introduce a practical and theoretically grounded transition sampling approach to address action imbalance during offline RL training. We perform extensive experiments on two real-world tasks for diabetes and sepsis treatment optimization to compare performance of the proposed approach against prominent off-policy and offline RL baselines (DDQN and CQL). Across a range of principled and clinically relevant metrics, we show that our proposed approach enables substantial improvements in expected health outcomes and in accordance with relevant practice and safety guidelines.
翻译:针对多种慢性病管理和重症监护应用中的数据驱动优化治疗策略方法越来越受到关注。强化学习方法非常适合这种序贯决策问题,但由于直接在线探索不安全且不可行,必须仅在回顾性医疗记录数据集上进行训练和评估。尽管有此要求,绝大多数治疗优化研究仍采用已知在纯离线设置下表现不佳的离策略强化学习方法(如双深度Q网络(DDQN)或其变体)。近期离线强化学习的进展,例如保守Q学习(CQL),提供了合适的替代方案。但在将这些方法应用于真实场景时仍存在挑战,例如回顾性数据集中次优样本占主导地位以及需要满足严格的安全约束。在本工作中,我们引入了一种实用且理论基础扎实的转移采样方法,以解决离线强化学习训练中的动作不平衡问题。我们在糖尿病和脓毒症治疗优化这两个真实任务上进行了大量实验,将所提方法的性能与主流离策略和离线强化学习基线方法(DDQN和CQL)进行了比较。在一系列原则性的、临床相关的指标上,我们证明所提方法能够根据相关实践和安全指南,显著改善预期健康结局。