A key challenge in machine learning is to design interpretable models that can reduce their inputs to the best subset for making transparent predictions, especially in the clinical domain. In this work, we propose a certifiably optimal feature selection procedure for logistic regression from a mixed-integer conic optimization perspective that can take an auxiliary cost to obtain features into account. Based on an extensive review of the literature, we carefully create a synthetic dataset generator for clinical prognostic model research. This allows us to systematically evaluate different heuristic and optimal cardinality- and budget-constrained feature selection procedures. The analysis shows key limitations of the methods for the low-data regime and when confronted with label noise. Our paper not only provides empirical recommendations for suitable methods and dataset designs, but also paves the way for future research in the area of meta-learning.
翻译:机器学习的一个关键挑战是设计可解释的模型,能够将其输入压缩为最佳子集以进行透明预测,尤其在临床领域。本文从混合整数锥优化的视角,提出了一种针对逻辑回归的可证明最优特征选择流程,该流程能够将获取特征的辅助成本纳入考量。基于对文献的广泛回顾,我们精心构建了一个用于临床预后模型研究的合成数据集生成器,从而能够系统评估不同启发式方法和最优基数约束及预算约束的特征选择流程。分析揭示了这些方法在低数据量场景及面对标签噪声时的关键局限性。本文不仅为合适的方法与数据集设计提供了实证建议,还为元学习领域的未来研究铺平了道路。