Interpretable policy learning seeks to estimate intelligible decision policies from observed actions; however, existing models fall short by forcing a tradeoff between accuracy and interpretability. This tradeoff limits data-driven interpretations of human decision-making process. e.g. to audit medical decisions for biases and suboptimal practices, we require models of decision processes which provide concise descriptions of complex behaviors. Fundamentally, existing approaches are burdened by this tradeoff because they represent the underlying decision process as a universal policy, when in fact human decisions are dynamic and can change drastically with contextual information. Thus, we propose Contextualized Policy Recovery (CPR), which re-frames the problem of modeling complex decision processes as a multi-task learning problem in which complex decision policies are comprised of context-specific policies. CPR models each context-specific policy as a linear observation-to-action mapping, and generates new decision models $\textit{on-demand}$ as contexts are updated with new observations. CPR is compatible with fully offline and partially observable decision environments, and can be tailored to incorporate any recurrent black-box model or interpretable decision model. We assess CPR through studies on simulated and real data, achieving state-of-the-art performance on the canonical tasks of predicting antibiotic prescription in intensive care units ($+22\%$ AUROC vs. previous SOTA) and predicting MRI prescription for Alzheimer's patients ($+7.7\%$ AUROC vs. previous SOTA). With this improvement in predictive performance, CPR closes the accuracy gap between interpretable and black-box methods for policy learning, allowing high-resolution exploration and analysis of context-specific decision models.
翻译:可解释策略学习旨在从观测到的行为中估计可理解的决策策略;然而,现有模型因被迫在准确性与可解释性之间权衡而存在不足。这种权衡限制了基于数据的人类决策过程解释——例如,为审计医疗决策中的偏误和次优实践,我们需要既能提供复杂行为简明描述、又能建模决策过程的模型。本质上,现有方法受困于这一权衡源于其将底层决策过程表示为通用策略,而人类决策实际具有动态性,并会随情境信息发生剧烈变化。为此,我们提出情境化策略恢复(CPR),将复杂决策过程建模重新定义为多任务学习问题——复杂决策策略由情境特定策略组成。CPR将每个情境特定策略建模为线性观测-动作映射,并在情境随新观测更新时按需生成新决策模型。CPR兼容完全离线及部分可观测的决策环境,并可通过定制集成任意循环黑箱模型或可解释决策模型。我们通过模拟与真实数据实验评估CPR,在重症监护室抗生素处方预测(AUROC较此前最优方法提升22%)及阿尔茨海默病患者MRI处方预测(AUROC较此前最优方法提升7.7%)等经典任务上取得最优性能。凭借预测性能的提升,CPR弥合了可解释方法与黑箱方法在策略学习中的准确性鸿沟,为情境特定决策模型的高分辨率探索与分析提供了可能。