The Internet contains a wealth of knowledge -- from the birthdays of historical figures to tutorials on how to code -- all of which may be learned by language models. However, while certain pieces of information are ubiquitous on the web, others appear extremely rarely. In this paper, we study the relationship between the knowledge memorized by large language models and the information in pre-training datasets scraped from the web. In particular, we show that a language model's ability to answer a fact-based question relates to how many documents associated with that question were seen during pre-training. We identify these relevant documents by entity linking pre-training datasets and counting documents that contain the same entities as a given question-answer pair. Our results demonstrate strong correlational and causal relationships between accuracy and relevant document count for numerous question answering datasets (e.g., TriviaQA), pre-training corpora (e.g., ROOTS), and model sizes (e.g., 176B parameters). Moreover, while larger models are better at learning long-tail knowledge, we estimate that today's models must be scaled by many orders of magnitude to reach competitive QA performance on questions with little support in the pre-training data. Finally, we show that retrieval-augmentation can reduce the dependence on relevant pre-training information, presenting a promising approach for capturing the long-tail.
翻译:互联网蕴含着丰富的知识——从历史人物的生日到编程教程——所有这些都可能被语言模型所学习。然而,尽管某些信息在网络上随处可见,但另一些信息却极其罕见。在本文中,我们研究了大型语言模型记忆的知识与从网络抓取的预训练数据集中的信息之间的关系。具体而言,我们表明,语言模型回答基于事实的问题的能力与其在预训练期间看到的与该问题相关的文档数量有关。我们通过对预训练数据集进行实体链接,并统计包含与给定问答对相同实体的文档数量,来识别这些相关文档。我们的结果在多个问答数据集(例如TriviaQA)、预训练语料库(例如ROOTS)和模型规模(例如176B参数)上,均证明了准确率与相关文档数量之间存在强烈的相关性和因果关系。此外,虽然较大的模型更擅长学习长尾知识,但我们估计,当前的模型需要扩大多个数量级,才能在预训练数据中支持较少的问答问题上达到具有竞争力的问答性能。最后,我们表明,检索增强可以降低对相关预训练信息的依赖,为捕获长尾知识提供了一种有前景的方法。