The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation

This paper introduces the counter-intuitive generalization results of overfitting pre-trained large language models (LLMs) on very small datasets. In the setting of open-ended text generation, it is well-documented that LLMs tend to generate repetitive and dull sequences, a phenomenon that is especially apparent when generating using greedy decoding. This issue persists even with state-of-the-art LLMs containing billions of parameters, trained via next-token prediction on large datasets. We find that by further fine-tuning these models to achieve a near-zero training loss on a small set of samples -- a process we refer to as hyperfitting -- the long-sequence generative capabilities are greatly enhanced. Greedy decoding with these Hyperfitted models even outperform Top-P sampling over long-sequences, both in terms of diversity and human preferences. This phenomenon extends to LLMs of various sizes, different domains, and even autoregressive image generation. We further find this phenomena to be distinctly different from that of Grokking and double descent. Surprisingly, our experiments indicate that hyperfitted models rarely fall into repeating sequences they were trained on, and even explicitly blocking these sequences results in high-quality output. All hyperfitted models produce extremely low-entropy predictions, often allocating nearly all probability to a single token.

翻译：本文揭示了预训练大语言模型在极小数据集上过拟合的反直觉泛化结果。在开放式文本生成场景中，大语言模型倾向于生成重复且单调的序列，这一现象在使用贪心解码时尤为明显。即使是在海量数据集上通过下一词预测训练、包含数十亿参数的尖端大语言模型，该问题依然存在。我们发现，通过在小样本集上进一步微调模型以实现接近零的训练损失——这一过程我们称之为超拟合——模型的长序列生成能力得到显著增强。使用超拟合模型进行贪心解码，在长序列生成的多样性和人类偏好评估中，甚至优于Top-P采样。该现象适用于不同规模、不同领域的大语言模型，甚至可扩展至自回归图像生成。我们进一步发现此现象与顿悟现象及双下降现象存在本质区别。令人惊讶的是，实验表明超拟合模型极少重复训练序列中的模式，即使显式屏蔽这些序列仍能产生高质量输出。所有超拟合模型均生成极低熵值的预测，通常将几乎全部概率分配给单个词元。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日