Dynamic treatment regimes (DTRs) provide a principled framework for optimizing sequential decision-making in domains where decisions must adapt over time in response to individual trajectories, such as healthcare, education, and digital interventions. However, existing statistical methods often rely on strong positivity assumptions and lack robustness under partial data coverage, while offline reinforcement learning approaches typically focus on average training performance, lack statistical guarantees, and require solving complex optimization problems. To address these challenges, we propose POLAR, a novel pessimistic model-based policy learning algorithm for offline DTR optimization. POLAR estimates the transition dynamics from offline data and quantifies uncertainty for each history-action pair. A pessimistic penalty is then incorporated into the reward function to discourage actions with high uncertainty. Unlike many existing methods that focus on average training performance or provide guarantees only for an oracle policy, POLAR directly targets the suboptimality of the final learned policy and offers theoretical guarantees, without relying on computationally intensive minimax or constrained optimization procedures. To the best of our knowledge, POLAR is the first model-based DTR method to provide both statistical and computational guarantees, including finite-sample bounds on policy suboptimality. Empirical results on both synthetic data and the MIMIC-III dataset demonstrate that POLAR outperforms state-of-the-art methods and yields near-optimal, history-aware treatment strategies.
翻译:动态治疗方案(DTRs)为在医疗保健、教育和数字干预等领域中需要根据个体轨迹随时间调整决策的序列决策优化提供了一个原则性框架。然而,现有统计方法通常依赖于强正性假设且在部分数据覆盖下缺乏鲁棒性,而离线强化学习方法通常侧重于平均训练性能,缺乏统计保证,并且需要解决复杂的优化问题。为应对这些挑战,我们提出了POLAR,一种用于离线DTR优化的新型悲观模型策略学习算法。POLAR从离线数据中估计转移动态,并为每个历史-行动对量化不确定性。随后,在奖励函数中引入悲观惩罚项,以抑制具有高不确定性的行动。与许多关注平均训练性能或仅对理想策略提供保证的现有方法不同,POLAR直接针对最终学习策略的次优性并提供理论保证,且无需依赖计算密集的极小极大或约束优化过程。据我们所知,POLAR是首个同时提供统计与计算保证的基于模型的DTR方法,包括策略次优性的有限样本界。在合成数据和MIMIC-III数据集上的实证结果表明,POLAR优于现有先进方法,并能产生接近最优的、历史感知的治疗策略。