This paper introduces the counter-intuitive generalization results of overfitting pre-trained large language models (LLMs) on very small datasets. In the setting of open-ended text generation, it is well-documented that LLMs tend to generate repetitive and dull sequences, a phenomenon that is especially apparent when generating using greedy decoding. This issue persists even with state-of-the-art LLMs containing billions of parameters, trained via next-token prediction on large datasets. We find that by further fine-tuning these models to achieve a near-zero training loss on a small set of samples -- a process we refer to as hyperfitting -- the long-sequence generative capabilities are greatly enhanced. Greedy decoding with these Hyperfitted models even outperform Top-P sampling over long-sequences, both in terms of diversity and human preferences. This phenomenon extends to LLMs of various sizes, different domains, and even autoregressive image generation. We further find this phenomena to be distinctly different from that of Grokking and double descent. Surprisingly, our experiments indicate that hyperfitted models rarely fall into repeating sequences they were trained on, and even explicitly blocking these sequences results in high-quality output. All hyperfitted models produce extremely low-entropy predictions, often allocating nearly all probability to a single token.
翻译:本文揭示了预训练大语言模型在极小数据集上过拟合的反直觉泛化结果。在开放式文本生成场景中,大语言模型倾向于生成重复且单调的序列,这一现象在使用贪心解码时尤为明显。即使是在海量数据集上通过下一词预测训练、包含数十亿参数的尖端大语言模型,该问题依然存在。我们发现,通过在小样本集上进一步微调模型以实现接近零的训练损失——这一过程我们称之为超拟合——模型的长序列生成能力得到显著增强。使用超拟合模型进行贪心解码,在长序列生成的多样性和人类偏好评估中,甚至优于Top-P采样。该现象适用于不同规模、不同领域的大语言模型,甚至可扩展至自回归图像生成。我们进一步发现此现象与顿悟现象及双下降现象存在本质区别。令人惊讶的是,实验表明超拟合模型极少重复训练序列中的模式,即使显式屏蔽这些序列仍能产生高质量输出。所有超拟合模型均生成极低熵值的预测,通常将几乎全部概率分配给单个词元。