The traditional framework for feature selection treats all features as costing the same amount. However, in reality, a scientist often has considerable discretion regarding which variables to measure, and the decision involves a tradeoff between model accuracy and cost (where cost can refer to money, time, difficulty, or intrusiveness). In particular, unnecessarily including an expensive feature in a model is worse than unnecessarily including a cheap feature. We propose a procedure, which we call cheap knockoffs, for performing feature selection in a cost-conscious manner. The key idea behind our method is to force higher cost features to compete with more knockoffs than cheaper features. We derive an upper bound on the weighted false discovery proportion associated with this procedure, which corresponds to the fraction of the feature cost that is wasted on unimportant features. We prove that this bound holds simultaneously with high probability over a path of selected variable sets of increasing size. A user may thus select a set of features based, for example, on the overall budget, while knowing that no more than a particular fraction of feature cost is wasted. We investigate, through simulation and a biomedical application, the practical importance of incorporating cost considerations into the feature selection process.
翻译:传统特征选择框架假设所有特征的获取成本相同。然而在实际应用中,科学家往往对测量哪些变量拥有相当大的自主权,且决策过程需要在模型精度与成本(可指代金钱、时间、实施难度或干预程度)之间进行权衡。特别地,在模型中不必要地纳入高成本特征比纳入低成本特征造成的损失更严重。我们提出一种称为"廉价淘汰变量法"(cheap knockoffs)的流程,用于实现成本敏感型特征选择。该方法的核心思想是迫使高成本特征与更多淘汰变量(knockoffs)竞争,而低成本特征的竞争机制则相对宽松。我们推导出该流程中加权错误发现比例的上界,该比例衡量被无意义特征浪费的特征成本占比。我们证明,在沿递增规模的选定变量集合路径上,该上界以高概率同时成立。用户可基于总预算等因素选择特征集,同时确保浪费的特征成本不超过特定比例。通过仿真实验与生物医学应用案例,我们验证了将成本因素纳入特征选择过程的实际重要性。