Modern regression problems often involve high-dimensional data and a careful tuning of the regularization hyperparameters is crucial to avoid overly complex models that may overfit the training data while guaranteeing desirable properties like effective variable selection. We study the recently introduced direction of tuning regularization hyperparameters in linear regression across multiple related tasks. We obtain distribution-dependent bounds on the generalization error for the validation loss when tuning the L1 and L2 coefficients, including ridge, lasso and the elastic net. In contrast, prior work develops bounds that apply uniformly to all distributions, but such bounds necessarily degrade with feature dimension, d. While these bounds are shown to be tight for worst-case distributions, our bounds improve with the "niceness" of the data distribution. Concretely, we show that under additional assumptions that instances within each task are i.i.d. draws from broad well-studied classes of distributions including sub-Gaussians, our generalization bounds do not get worse with increasing d, and are much sharper than prior work for very large d. We also extend our results to a generalization of ridge regression, where we achieve tighter bounds that take into account an estimate of the mean of the ground truth distribution.
翻译:现代回归问题常涉及高维数据,且正则化超参数的精细调优至关重要,既能避免模型因过于复杂而过拟合训练数据,又能保证有效的变量筛选等理想特性。我们研究了近期提出的跨多相关任务线性回归中正则化超参数调优方向。在调节L1和L2系数(包括岭回归、Lasso及弹性网络)时,我们获得了验证损失泛化误差的分布依赖型界。相比之下,先前研究推导的界虽适用于所有分布,但必然随特征维度d的提升而退化。尽管这些界在最坏分布情形下被证明是紧凑的,我们的界会随数据分布的"优良性"而优化。具体而言,我们证明:在每任务内样本为独立同分布且来自广义亚高斯等广泛研究的分布类的额外假设下,我们的泛化界不会随d增大而恶化,且在d极大时远优于先前工作。我们还将结果推广至岭回归的泛化形式,通过纳入真实分布均值估计,获得了更紧凑的界。