Large language models (LLM) trained using the next-token-prediction objective, such as GPT3 and PaLM, have revolutionized natural language processing in recent years by showing impressive zero-shot and few-shot capabilities across a wide range of tasks. In this work, we propose a simple technique that significantly boosts the performance of LLMs without adding computational cost. Our key observation is that, by performing the next token prediction task with randomly selected past tokens masked out, we can improve the quality of the learned representations for downstream language understanding tasks. We hypothesize that randomly masking past tokens prevents over-attending to recent tokens and encourages attention to tokens in the distant past. We find that our method, Forgetful Causal Masking (FCM), significantly improves both few-shot and finetuning performance of PaLM. We further consider a simple extension, T-FCM, which introduces bidirectional context to causal language model without altering the sequence order, and further improves finetuning performance.
翻译:大型语言模型(LLM)通过下一个词元预测目标进行训练,例如GPT3和PaLM,近年来在各类任务中展现出惊人的零样本与少样本能力,从而彻底改变了自然语言处理。在本工作中,我们提出了一种简单技术,在不增加计算成本的情况下显著提升LLM的性能。我们的关键发现是:通过随机遮蔽部分过去词元来执行下一个词元预测任务,可以改善下游语言理解任务中学习到的表示质量。我们假设,随机遮蔽过去词元能够防止模型过度关注近期词元,并鼓励其关注更早之前的词元。实验表明,我们的方法——遗忘式因果遮蔽(FCM)显著提升了PaLM的少样本与微调性能。此外,我们进一步提出一种简单扩展T-FCM,该扩展在不改变序列顺序的情况下为因果语言模型引入双向上下文,并进一步提升了微调性能。