While achieving high prediction accuracy is a fundamental goal in machine learning, an equally important task is finding a small number of features with high explanatory power. One popular selection technique is permutation importance, which assesses a variable's impact by measuring the change in prediction error after permuting the variable. However, this can be problematic due to the need to create artificial data, a problem shared by other methods as well. Another problem is that variable selection methods can be limited by being model-specific. We introduce a new model-independent approach, Variable Priority (VarPro), which works by utilizing rules without the need to generate artificial data or evaluate prediction error. The method is relatively easy to use, requiring only the calculation of sample averages of simple statistics, and can be applied to many data settings, including regression, classification, and survival. We investigate the asymptotic properties of VarPro and show, among other things, that VarPro has a consistent filtering property for noise variables. Empirical studies using synthetic and real-world data show the method achieves a balanced performance and compares favorably to many state-of-the-art procedures currently used for variable selection.
翻译:在机器学习中,实现高预测精度固然是一个基本目标,但同样重要的任务是找到少量具有高解释力的特征。一种流行的选择技术是置换重要性,该方法通过置换变量后测量预测误差的变化来评估变量的影响。然而,由于需要创建人工数据,这可能存在问题,其他方法也普遍存在这一问题。另一个问题是,变量选择方法可能因依赖特定模型而受到限制。我们提出了一种新的不依赖模型的方法——变量优先级(VarPro),该方法通过利用规则来工作,无需生成人工数据或评估预测误差。该方法相对易于使用,仅需计算简单统计量的样本均值,并可应用于许多数据场景,包括回归、分类和生存分析。我们研究了VarPro的渐近性质,并证明其对于噪声变量具有一致的过滤特性。使用合成数据和真实世界数据进行的实证研究表明,该方法实现了均衡的性能,与当前用于变量选择的许多先进方法相比具有优势。