Causal language models acquire vast amount of knowledge from general text corpus during pretraining, but the efficiency of knowledge learning is known to be unsatisfactory, especially when learning from knowledge-dense and small-sized corpora. The deficiency can come from long-distance dependencies which are hard to capture by language models, and overfitting to co-occurrence patterns and distracting clues in the training text. To address these issues, the paper proposes a method to enhance knowledge learning during language model pretraining, by enhancing elusive but important clues in text discovered by the language model themselves. We found that larger language models pay more attention to non-obvious but important clues, which are often overlooked by smaller language models. Therefore, we can identify these clues by contrasting the attention weights of large and small language models. We use the identified clues as a guide to perform token-dropout data augmentation on the training text, and observed a significant boost in both small and large models' performance in fact memorization. This shows that the behavior contrast between more and less-performant language models contains important clues for knowledge learning, and it can be ``amplified" for a straight-forward improvement in knowledge learning efficiency.
翻译:因果语言模型在预训练过程中从通用文本语料库中获取了大量知识,但已知其知识学习效率并不理想,尤其是在从知识密集且规模较小的语料库中学习时。这种缺陷可能源于语言模型难以捕捉的长距离依赖关系,以及对训练文本中共现模式和干扰性线索的过拟合。为解决这些问题,本文提出一种在语言模型预训练阶段增强知识学习的方法,其核心在于强化语言模型自身发现的文本中难以捕捉但重要的线索。我们发现,较大的语言模型会更关注非显而易见但重要的线索,而这些线索常被较小的语言模型忽略。因此,我们可以通过对比大模型与小模型的注意力权重来识别这些线索。我们利用识别出的线索作为指导,对训练文本进行词符丢弃数据增强,并观察到大小模型在事实记忆任务上的性能均获得显著提升。这表明,性能较强与较弱语言模型之间的行为对比蕴含着知识学习的重要线索,且这种对比可以被“放大”以直接提升知识学习效率。