Variable selection in linear regression models: choosing the best subset is not always the best choice

Variable selection in linear regression settings is a much discussed problem. Best subset selection (BSS) is often considered the intuitive 'gold standard', with its use being restricted only by its NP-hard nature. Alternatives such as the least absolute shrinkage and selection operator (Lasso) or the elastic net (Enet) have become methods of choice in high-dimensional settings. A recent proposal represents BSS as a mixed integer optimization problem so that much larger problems have become feasible in reasonable computation time. We present an extensive neutral comparison assessing the variable selection performance, in linear regressions, of BSS compared to forward stepwise selection (FSS), Lasso and Enet. The simulation study considers a wide range of settings that are challenging with regard to dimensionality (with respect to the number of observations and variables), signal-to-noise ratios and correlations between predictors. As main measure of performance, we used the best possible F1-score for each method to ensure a fair comparison irrespective of any criterion for choosing the tuning parameters, and results were confirmed by alternative performance measures. Somewhat surprisingly, it was only in settings where the signal-to-noise ratio was high and the variables were (nearly) uncorrelated that BSS reliably outperformed the other methods, even in low-dimensional settings. Further, the FSS's performance was nearly identical to BSS. Our results shed new light on the usual presumption of BSS being, in principle, the best choice for variable selection. Especially for correlated variables, alternatives like Enet are faster and appear to perform better in practical settings.

翻译：线性回归中的变量选择是一个备受关注的问题。最优子集选择（BSS）通常被视为直观的“黄金标准”，但其应用受限于NP难特性。替代方法如最小绝对收缩和选择算子（Lasso）或弹性网（Enet）已成为高维场景中的首选方法。近期研究将BSS表述为混合整数优化问题，使更大规模的问题可在合理计算时间内解决。我们通过广泛的对比实验，系统评估了BSS与向前逐步选择（FSS）、Lasso和Enet在线性回归中的变量选择性能。模拟研究涵盖了在维度（针对观测数和变量数）、信噪比及预测因子相关性方面具有挑战性的多种场景。我们以各方法的最优F1分数作为主要性能指标，确保在无需选择调优参数的情况下实现公平比较，并通过其他性能指标验证结果。令人意外的是，仅在信噪比高且变量（近乎）不相关时，BSS才能稳定优于其他方法——即便在低维场景中也是如此。此外，FSS的性能与BSS几乎一致。我们的研究结果对“BSS原则上是最优变量选择方案”的传统认知提出了新见解。尤其对于相关变量，Enet等替代方法不仅速度更快，在实际场景中的表现也更优。