Modern language models can contain billions of parameters, raising the question of whether they can generalize beyond the training data or simply regurgitate their training corpora. We provide the first non-vacuous generalization bounds for pretrained large language models (LLMs), indicating that language models are capable of discovering regularities that generalize to unseen data. In particular, we derive a compression bound that is valid for the unbounded log-likelihood loss using prediction smoothing, and we extend the bound to handle subsampling, accelerating bound computation on massive datasets. To achieve the extreme level of compression required for non-vacuous generalization bounds, we devise SubLoRA, a low-dimensional non-linear parameterization. Using this approach, we find that larger models have better generalization bounds and are more compressible than smaller models.
翻译:现代语言模型可以包含数十亿参数,这引发了一个问题:它们是否能够推广到训练数据之外,还是仅仅机械重复训练语料库。我们首次为预训练大语言模型(LLMs)提供了非平凡泛化界,表明语言模型能够发现可推广到未见数据的规律性。特别地,我们推导出一个适用于无界对数似然损失的压缩界,通过预测平滑方法实现,并将该界扩展至子采样处理,从而加速大规模数据集上的界计算。为了达到非平凡泛化界所需的极端压缩水平,我们设计了SubLoRA,一种低维非线性参数化方法。采用这一方法,我们发现更大规模的模型具有更好的泛化界,并且比小模型更具可压缩性。