The problem of best subset selection in linear regression is considered with the aim to find a fixed size subset of features that best fits the response. This is particularly challenging when the total available number of features is very large compared to the number of data samples. Existing optimal methods for solving this problem tend to be slow while fast methods tend to have low accuracy. Ideally, new methods perform best subset selection faster than existing optimal methods but with comparable accuracy, or, being more accurate than methods of comparable computational speed. Here, we propose a novel continuous optimization method that identifies a subset solution path, a small set of models of varying size, that consists of candidates for the single best subset of features, that is optimal in a specific sense in linear regression. Our method turns out to be fast, making the best subset selection possible when the number of features is well in excess of thousands. Because of the outstanding overall performance, framing the best subset selection challenge as a continuous optimization problem opens new research directions for feature extraction for a large variety of regression models.
翻译:线性回归中的最佳子集选择问题旨在找出固定大小的特征子集,使其对响应变量的拟合效果最优。当可用特征总数远超数据样本量时,该问题尤为棘手。现有的最优方法求解速度缓慢,而快速方法往往精度较低。理想情况下,新方法应比现有最优方法更快地完成最佳子集选择,同时保持相近精度;或在计算速度相当的方法中实现更高精度。本文提出了一种新颖的连续优化方法,该方法能够生成子集解路径——包含一组规模各异的候选模型,这些模型可作为线性回归中特定意义下最优的单一最佳特征子集。实验表明,该方法运算速度快,使得在特征数量远超数千时仍能进行最佳子集选择。凭借其卓越的整体性能,将最佳子集选择问题转化为连续优化问题,为各类回归模型的特征提取开辟了新的研究方向。