Modern language models can contain billions of parameters, raising the question of whether they can generalize beyond the training data or simply regurgitate their training corpora. We provide the first non-vacuous generalization bounds for pretrained large language models (LLMs), indicating that language models are capable of discovering regularities that generalize to unseen data. In particular, we derive a compression bound that is valid for the unbounded log-likelihood loss using prediction smoothing, and we extend the bound to handle subsampling, accelerating bound computation on massive datasets. To achieve the extreme level of compression required for non-vacuous generalization bounds, we devise SubLoRA, a low-dimensional non-linear parameterization. Using this approach, we find that larger models have better generalization bounds and are more compressible than smaller models.
翻译:现代语言模型可包含数十亿参数,由此引发一个关键问题:这些模型是能够泛化至训练数据之外,还是仅仅机械重复其训练语料。我们首次为预训练大型语言模型(LLMs)建立了非空洞泛化边界,表明语言模型具备发现可泛化至未见数据的规律的能力。具体而言,我们利用预测平滑方法推导出适用于无界对数似然损失的压缩边界,并扩展该边界以支持子采样处理,从而加速海量数据集上的边界计算。为实现非空洞泛化边界所需的极端压缩水平,我们设计了SubLoRA——一种低维非线性参数化方法。通过该方案,我们发现规模更大的模型具有更优的泛化边界,且比小模型更具可压缩性。