Interpretable policy learning seeks to estimate intelligible decision policies from observed actions; however, existing models fall short by forcing a tradeoff between accuracy and interpretability. This tradeoff limits data-driven interpretations of human decision-making process. e.g. to audit medical decisions for biases and suboptimal practices, we require models of decision processes which provide concise descriptions of complex behaviors. Fundamentally, existing approaches are burdened by this tradeoff because they represent the underlying decision process as a universal policy, when in fact human decisions are dynamic and can change drastically with contextual information. Thus, we propose Contextualized Policy Recovery (CPR), which re-frames the problem of modeling complex decision processes as a multi-task learning problem in which complex decision policies are comprised of context-specific policies. CPR models each context-specific policy as a linear observation-to-action mapping, and generates new decision models $\textit{on-demand}$ as contexts are updated with new observations. CPR is compatible with fully offline and partially observable decision environments, and can be tailored to incorporate any recurrent black-box model or interpretable decision model. We assess CPR through studies on simulated and real data, achieving state-of-the-art performance on the canonical tasks of predicting antibiotic prescription in intensive care units ($+22\%$ AUROC vs. previous SOTA) and predicting MRI prescription for Alzheimer's patients ($+7.7\%$ AUROC vs. previous SOTA). With this improvement in predictive performance, CPR closes the accuracy gap between interpretable and black-box methods for policy learning, allowing high-resolution exploration and analysis of context-specific decision models.
翻译:可解释的策略学习旨在从观测行为中估计可理解的决策策略;然而现有模型在准确性与可解释性之间强制进行权衡,从而限制了数据驱动的决策过程分析。例如,要审计医疗决策中的偏差与次优实践,需要能够对复杂行为提供简洁描述的决策过程模型。从根本上而言,现有方法因将底层决策过程表示为通用策略而受困于这种权衡——实际上人类决策是动态的,并会随上下文信息发生显著变化。为此,我们提出上下文感知策略恢复(CPR),将复杂决策过程的建模问题重构为多任务学习问题,其中复杂决策策略由特定上下文策略组成。CPR将每个上下文特定策略建模为线性观测-动作映射,并在新观测更新上下文时按需生成新的决策模型。CPR兼容完全离线与部分可观测的决策环境,并可适配任意循环黑盒模型或可解释决策模型。通过仿真与真实数据研究,CPR在重症监护室抗生素处方预测(AUROC相较先前最优方法提升22%)及阿尔茨海默症患者MRI处方预测(AUROC提升7.7%)等经典任务中达到当前最优性能。凭借预测性能的提升,CPR弥合了可解释方法与黑盒方法在策略学习中的准确率差距,实现了对上下文特定决策模型的高分辨率探索与分析。