RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content

Large Language Models (LLMs) are trained on vast amounts of data, most of which is automatically scraped from the internet. This data includes encyclopedic documents that harbor a vast amount of general knowledge (e.g., Wikipedia) but also potentially overlap with benchmark datasets used for evaluating LLMs. Consequently, evaluating models on test splits that might have leaked into the training set is prone to misleading conclusions. To foster sound evaluation of language models, we introduce a new test dataset named RepLiQA, suited for question-answering and topic retrieval tasks. RepLiQA is a collection of five splits of test sets, four of which have not been released to the internet or exposed to LLM APIs prior to this publication. Each sample in RepLiQA comprises (1) a reference document crafted by a human annotator and depicting an imaginary scenario (e.g., a news article) absent from the internet; (2) a question about the document's topic; (3) a ground-truth answer derived directly from the information in the document; and (4) the paragraph extracted from the reference document containing the answer. As such, accurate answers can only be generated if a model can find relevant content within the provided document. We run a large-scale benchmark comprising several state-of-the-art LLMs to uncover differences in performance across models of various types and sizes in a context-conditional language modeling setting. Released splits of RepLiQA can be found here: https://huggingface.co/datasets/ServiceNow/repliqa.

翻译：大语言模型（LLMs）在从互联网自动爬取的海量数据上进行训练。这些数据既包含蕴含大量通用知识的百科全书式文档（如维基百科），也可能与用于评估LLMs的基准数据集存在潜在重叠。因此，在可能已泄露至训练集的测试集上评估模型容易导致误导性结论。为促进语言模型的可靠评估，我们引入了一个名为RepLiQA的新测试数据集，适用于问答与主题检索任务。RepLiQA包含五个测试集划分，其中四个在本论文发表前未在互联网上发布或暴露于LLM API。RepLiQA中的每个样本包含：（1）由人工标注者撰写、描述互联网上不存在的虚构场景（如新闻报道）的参考文档；（2）关于文档主题的问题；（3）直接源自文档信息的真实答案；（4）从参考文档中提取的包含答案的段落。因此，只有当模型能够在给定文档中找到相关内容时，才能生成准确答案。我们进行了涵盖多种前沿大语言模型的大规模基准测试，以揭示在上下文条件语言建模设置下，不同类型和规模模型之间的性能差异。RepLiQA已发布的划分可在此处获取：https://huggingface.co/datasets/ServiceNow/repliqa。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日