Policy Optimization for Personalized Interventions in Behavioral Health

Problem definition: Behavioral health interventions, delivered through digital platforms, have the potential to significantly improve health outcomes, through education, motivation, reminders, and outreach. We study the problem of optimizing personalized interventions for patients to maximize some long-term outcome, in a setting where interventions are costly and capacity-constrained. Methodology/results: This paper provides a model-free approach to solving this problem. We find that generic model-free approaches from the reinforcement learning literature are too data intensive for healthcare applications, while simpler bandit approaches make progress at the expense of ignoring long-term patient dynamics. We present a new algorithm we dub DecompPI that approximates one step of policy iteration. Implementing DecompPI simply consists of a prediction task from offline data, alleviating the need for online experimentation. Theoretically, we show that under a natural set of structural assumptions on patient dynamics, DecompPI surprisingly recovers at least 1/2 of the improvement possible between a naive baseline policy and the optimal policy. At the same time, DecompPI is both robust to estimation errors and interpretable. Through an empirical case study on a mobile health platform for improving treatment adherence for tuberculosis, we find that DecompPI can provide the same efficacy as the status quo with approximately half the capacity of interventions. Managerial implications: DecompPI is general and is easily implementable for organizations aiming to improve long-term behavior through targeted interventions. Our case study suggests that the platform's costs of deploying interventions can potentially be cut by 50%, which facilitates the ability to scale up the system in a cost-efficient fashion.

翻译：问题定义：通过数字化平台提供的行为健康干预措施，有望通过教育、激励、提醒和推广等方式显著改善健康结果。本文研究在干预措施具有成本且容量受限的背景下，如何优化针对患者的个性化干预以最大化长期疗效。方法论/结果：本文提出了一种解决该问题的无模型方法。我们发现，强化学习文献中的通用无模型方法对医疗应用而言数据需求过高，而更简单的赌博机方法在忽略患者长期动态变化的情况下取得进展。我们提出一种名为DecompPI的新算法，该算法近似实现了一步策略迭代。实施DecompPI仅需基于离线数据进行预测任务，无需在线实验。理论上，我们证明在关于患者动态的一组自然结构假设下，DecompPI能惊人地恢复基线策略与最优策略之间至少50%的改善空间。同时，DecompPI兼具对估计误差的鲁棒性和可解释性。通过对一个用于改善结核病治疗依从性的移动健康平台进行实证案例研究，我们发现DecompPI在干预容量减半的情况下能达到与现状同等的疗效。管理启示：DecompPI具有通用性，且易于被那些旨在通过针对性干预改善长期行为的组织实施。我们的案例研究表明，平台部署干预措施的成本有望降低50%，从而有助于以经济高效的方式扩大系统规模。