We consider the problem of learning classification trees that are robust to distribution shifts between training and testing/deployment data. This problem arises frequently in high stakes settings such as public health and social work where data is often collected using self-reported surveys which are highly sensitive to e.g., the framing of the questions, the time when and place where the survey is conducted, and the level of comfort the interviewee has in sharing information with the interviewer. We propose a method for learning optimal robust classification trees based on mixed-integer robust optimization technology. In particular, we demonstrate that the problem of learning an optimal robust tree can be cast as a single-stage mixed-integer robust optimization problem with a highly nonlinear and discontinuous objective. We reformulate this problem equivalently as a two-stage linear robust optimization problem for which we devise a tailored solution procedure based on constraint generation. We evaluate the performance of our approach on numerous publicly available datasets, and compare the performance to a regularized, non-robust optimal tree. We show an increase of up to 12.48% in worst-case accuracy and of up to 4.85% in average-case accuracy across several datasets and distribution shifts from using our robust solution in comparison to the non-robust one.
翻译:我们研究了在训练数据与测试/部署数据之间存在分布偏移时,学习具有鲁棒性的分类树问题。这一问题常见于公共卫生和社会工作等高敏感性场景中,此类数据通常通过自我报告调查收集,极易受到问题措辞、调查时间与地点、受访者与调查者分享信息时的舒适程度等因素影响。我们提出了一种基于混合整数鲁棒优化技术的最优鲁棒分类树学习方法。具体而言,我们证明了最优鲁棒树的学习问题可转化为一个具有高度非线性和非连续目标函数的单阶段混合整数鲁棒优化问题。通过等价重构,我们将该问题转化为两阶段线性鲁棒优化问题,并基于约束生成技术设计了定制化求解流程。我们在多个公开数据集上评估了该方法,并将其性能与正则化的非鲁棒最优树进行了对比。结果表明,与非鲁棒解相比,采用我们的鲁棒方法在多个数据集及分布偏移场景下,最坏情况准确率提升了最高12.48%,平均情况准确率提升了最高4.85%。