Gain Confidence, Reduce Disappointment: A New Approach to Cross-Validation for Sparse Regression

Ridge regularized sparse regression involves selecting a subset of features that explains the relationship between a design matrix and an output vector in an interpretable manner. To select the sparsity and robustness of linear regressors, techniques like leave-one-out cross-validation are commonly used for hyperparameter tuning. However, cross-validation typically increases the cost of sparse regression by several orders of magnitude. Additionally, validation metrics are noisy estimators of the test-set error, with different hyperparameter combinations giving models with different amounts of noise. Therefore, optimizing over these metrics is vulnerable to out-of-sample disappointment, especially in underdetermined settings. To address this, we make two contributions. First, we leverage the generalization theory literature to propose confidence-adjusted variants of leave-one-out that display less propensity to out-of-sample disappointment. Second, we leverage ideas from the mixed-integer literature to obtain computationally tractable relaxations of confidence-adjusted leave-one-out, thereby minimizing it without solving as many MIOs. Our relaxations give rise to an efficient coordinate descent scheme which allows us to obtain significantly lower leave-one-out errors than via other methods in the literature. We validate our theory by demonstrating we obtain significantly sparser and comparably accurate solutions than via popular methods like GLMNet and suffer from less out-of-sample disappointment. On synthetic datasets, our confidence adjustment procedure generates significantly fewer false discoveries, and improves out-of-sample performance by 2-5% compared to cross-validating without confidence adjustment. Across a suite of 13 real datasets, a calibrated version of our procedure improves the test set error by an average of 4% compared to cross-validating without confidence adjustment.

翻译：岭正则化稀疏回归涉及选择解释设计矩阵与输出向量之间关系的可解释特征子集。为了选择线性回归器的稀疏性和鲁棒性，通常采用留一法交叉验证等技术进行超参数调优。然而，交叉验证通常会使稀疏回归的计算成本增加几个数量级。此外，验证指标是对测试集误差的有噪声估计，不同的超参数组合会赋予模型不同程度的噪声。因此，基于这些指标进行优化容易导致样本外失望，尤其是在欠定场景中。为解决这一问题，我们做出了两项贡献。首先，我们利用泛化理论文献，提出了置信度调整版的留一法，其表现出更低的样本外失望倾向。其次，我们借鉴混合整数文献中的思想，获得了置信度调整留一法的计算可处理松弛形式，从而在不求解大量MIOs的情况下最小化该指标。我们的松弛方法产生了一种高效的坐标下降方案，使我们能够获得比文献中其他方法显著更低的留一法误差。我们通过实验验证了理论：与GLMNet等流行方法相比，我们获得了显著更稀疏且精度相当的解，且样本外失望更少。在合成数据集上，我们的置信度调整程序生成的错误发现显著减少，与未使用置信度调整的交叉验证相比，将样本外性能提升了2-5%。在13个真实数据集的套件中，我们程序的校准版本与未使用置信度调整的交叉验证相比，平均将测试集误差降低了4%。