Controlling Costs: Feature Selection on a Budget

The traditional framework for feature selection treats all features as costing the same amount. However, in reality, a scientist often has considerable discretion regarding which variables to measure, and the decision involves a tradeoff between model accuracy and cost (where cost can refer to money, time, difficulty, or intrusiveness). In particular, unnecessarily including an expensive feature in a model is worse than unnecessarily including a cheap feature. We propose a procedure, which we call cheap knockoffs, for performing feature selection in a cost-conscious manner. The key idea behind our method is to force higher cost features to compete with more knockoffs than cheaper features. We derive an upper bound on the weighted false discovery proportion associated with this procedure, which corresponds to the fraction of the feature cost that is wasted on unimportant features. We prove that this bound holds simultaneously with high probability over a path of selected variable sets of increasing size. A user may thus select a set of features based, for example, on the overall budget, while knowing that no more than a particular fraction of feature cost is wasted. We investigate, through simulation and a biomedical application, the practical importance of incorporating cost considerations into the feature selection process.

翻译：传统特征选择框架假设所有特征的获取成本相同。然而在实际应用中，科学家往往对测量哪些变量拥有相当大的自主权，且决策过程需要在模型精度与成本（可指代金钱、时间、实施难度或干预程度）之间进行权衡。特别地，在模型中不必要地纳入高成本特征比纳入低成本特征造成的损失更严重。我们提出一种称为"廉价淘汰变量法"（cheap knockoffs）的流程，用于实现成本敏感型特征选择。该方法的核心思想是迫使高成本特征与更多淘汰变量（knockoffs）竞争，而低成本特征的竞争机制则相对宽松。我们推导出该流程中加权错误发现比例的上界，该比例衡量被无意义特征浪费的特征成本占比。我们证明，在沿递增规模的选定变量集合路径上，该上界以高概率同时成立。用户可基于总预算等因素选择特征集，同时确保浪费的特征成本不超过特定比例。通过仿真实验与生物医学应用案例，我们验证了将成本因素纳入特征选择过程的实际重要性。

相关内容

特征选择

关注 5940

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

【干货书】机器学习速查手册，135页pdf

专知会员服务

129+阅读 · 2020年11月20日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日