When training large models on limited data, avoiding overfitting is paramount. Common grid search or smarter search methods rely on expensive separate runs for each candidate hyperparameter, while carving out a validation set that reduces available training data. In this paper, we study gradient-based learning of hyperparameters via the evidence lower bound (ELBO) objective from Bayesian variational methods. This avoids the need for any validation set. We focus on scenarios where the model is over-parameterized for flexibility and the approximate posterior is chosen to be Gaussian with isotropic covariance for tractability, even though it cannot match the true posterior. In such scenarios, we find the ELBO prioritizes posteriors that match the prior, leading to severe underfitting. Instead, we recommend a data-emphasized ELBO that upweights the likelihood but not the prior. In Bayesian transfer learning of image and text classifiers, our method reduces the 88+ hour grid search of past work to under 3 hours while delivering comparable accuracy. We further demonstrate how our approach enables efficient yet accurate approximations of Gaussian processes with learnable lengthscale kernels.
翻译:在有限数据上训练大型模型时,避免过拟合至关重要。常见的网格搜索或更智能的搜索方法需要对每个候选超参数进行昂贵的独立运行,同时需要划分验证集,这减少了可用的训练数据。本文研究基于贝叶斯变分方法中的证据下界(ELBO)目标对超参数进行梯度学习。该方法无需任何验证集。我们重点关注模型参数过拟合以获取灵活性,且近似后验选择为各向同性协方差的高斯分布(即使其无法匹配真实后验)以保证可计算性的场景。在此类场景中,我们发现ELBO倾向于拟合先验的后验分布,导致严重欠拟合。为此,我们推荐一种数据加权的ELBO,它提高似然权重但不提高先验权重。在图像和文本分类器的贝叶斯迁移学习中,我们的方法将以往需88小时以上的网格搜索降至3小时以内,同时保持相当的精度。我们进一步展示了该方法如何实现对具有可学习长度尺度的核函数的高斯过程进行高效且精确的近似。