Large Language Models (LLMs) are trained on vast amounts of data, most of which is automatically scraped from the internet. This data includes encyclopedic documents that harbor a vast amount of general knowledge (e.g., Wikipedia) but also potentially overlap with benchmark datasets used for evaluating LLMs. Consequently, evaluating models on test splits that might have leaked into the training set is prone to misleading conclusions. To foster sound evaluation of language models, we introduce a new test dataset named RepLiQA, suited for question-answering and topic retrieval tasks. RepLiQA is a collection of five splits of test sets, four of which have not been released to the internet or exposed to LLM APIs prior to this publication. Each sample in RepLiQA comprises (1) a reference document crafted by a human annotator and depicting an imaginary scenario (e.g., a news article) absent from the internet; (2) a question about the document's topic; (3) a ground-truth answer derived directly from the information in the document; and (4) the paragraph extracted from the reference document containing the answer. As such, accurate answers can only be generated if a model can find relevant content within the provided document. We run a large-scale benchmark comprising several state-of-the-art LLMs to uncover differences in performance across models of various types and sizes in a context-conditional language modeling setting. Released splits of RepLiQA can be found here: https://huggingface.co/datasets/ServiceNow/repliqa.
翻译:大语言模型(LLMs)在从互联网自动爬取的海量数据上进行训练。这些数据既包含蕴含大量通用知识的百科全书式文档(如维基百科),也可能与用于评估LLMs的基准数据集存在潜在重叠。因此,在可能已泄露至训练集的测试集上评估模型容易导致误导性结论。为促进语言模型的可靠评估,我们引入了一个名为RepLiQA的新测试数据集,适用于问答与主题检索任务。RepLiQA包含五个测试集划分,其中四个在本论文发表前未在互联网上发布或暴露于LLM API。RepLiQA中的每个样本包含:(1)由人工标注者撰写、描述互联网上不存在的虚构场景(如新闻报道)的参考文档;(2)关于文档主题的问题;(3)直接源自文档信息的真实答案;(4)从参考文档中提取的包含答案的段落。因此,只有当模型能够在给定文档中找到相关内容时,才能生成准确答案。我们进行了涵盖多种前沿大语言模型的大规模基准测试,以揭示在上下文条件语言建模设置下,不同类型和规模模型之间的性能差异。RepLiQA已发布的划分可在此处获取:https://huggingface.co/datasets/ServiceNow/repliqa。