Offline reinforcement learning (ORL) has shown potential for improving decision-making in healthcare. However, contemporary research typically aggregates patient data into fixed time intervals, simplifying their mapping to standard ORL frameworks. The impact of these temporal manipulations on model safety and efficacy remains poorly understood. In this work, using both a gridworld navigation task and the UVA/Padova clinical diabetes simulator, we demonstrate that temporal resampling significantly degrades the performance of offline reinforcement learning algorithms during live deployment. We propose three mechanisms that drive this failure: (i) the generation of counterfactual trajectories, (ii) the distortion of temporal expectations, and (iii) the compounding of generalisation errors. Crucially, we find that standard off-policy evaluation metrics can fail to detect these drops in performance. Our findings reveal a fundamental risk in current healthcare ORL pipelines and emphasise the need for methods that explicitly handle the irregular timing of clinical decision-making.
翻译:离线强化学习(ORL)在改进医疗决策方面展现出潜力。然而,当前研究通常将患者数据聚合到固定时间间隔内,从而简化其与标准ORL框架的映射。这些时间操作对模型安全性和有效性的影响仍未被充分理解。在本研究中,我们通过网格世界导航任务和UVA/Padova临床糖尿病模拟器证明,时间重采样会显著降低离线强化学习算法在实际部署中的性能。我们提出了导致这种失效的三种机制:(i)反事实轨迹的生成,(ii)时间期望的扭曲,以及(iii)泛化误差的累积。关键的是,我们发现标准的离策略评估指标可能无法检测到这些性能下降。我们的研究结果揭示了当前医疗ORL流程中的根本性风险,并强调需要开发能够显式处理临床决策不规则时序的方法。