DORIS-MAE: Scientific Document Retrieval using Multi-level Aspect-based Queries

In scientific research, the ability to effectively retrieve relevant documents based on complex, multifaceted queries is critical. Existing evaluation datasets for this task are limited, primarily due to the high cost and effort required to annotate resources that effectively represent complex queries. To address this, we propose a novel task, Scientific DOcument Retrieval using Multi-level Aspect-based quEries (DORIS-MAE), which is designed to handle the complex nature of user queries in scientific research. We developed a benchmark dataset within the field of computer science, consisting of 100 human-authored complex query cases. For each complex query, we assembled a collection of 100 relevant documents and produced annotated relevance scores for ranking them. Recognizing the significant labor of expert annotation, we also introduce Anno-GPT, a scalable framework for validating the performance of Large Language Models (LLMs) on expert-level dataset annotation tasks. LLM annotation of the DORIS-MAE dataset resulted in a 500x reduction in cost, without compromising quality. Furthermore, due to the multi-tiered structure of these complex queries, the DORIS-MAE dataset can be extended to over 4,000 sub-query test cases without requiring additional annotation. We evaluated 17 recent retrieval methods on DORIS-MAE, observing notable performance drops compared to traditional datasets. This highlights the need for better approaches to handle complex, multifaceted queries in scientific research. Our dataset and codebase are available at https://github.com/Real-Doris-Mae/Doris-Mae-Dataset.

翻译：在科学研究中，基于复杂多层面查询有效检索相关文献的能力至关重要。现有针对该任务的评估数据集十分有限，主要原因是标注能有效表征复杂查询的资源成本高昂且工作繁重。为解决这一问题，我们提出了一项新任务——基于多层级方面查询的科学文献检索（DORIS-MAE），旨在处理科学研究中用户查询的复杂特性。我们在计算机科学领域构建了一个基准数据集，包含100个人工撰写的复杂查询案例。针对每个复杂查询，我们收集了100篇相关文献，并生成了用于排序的标注相关性分数。考虑到专家标注的巨大工作量，我们还引入了Anno-GPT这一可扩展框架，用于验证大型语言模型（LLM）在专家级数据集标注任务上的性能。使用LLM对DORIS-MAE数据集进行标注，成本降低了500倍，且未牺牲质量。此外，由于这些复杂查询的多层级结构，DORIS-MAE数据集可扩展至超过4000个子查询测试案例，无需额外标注。我们在DORIS-MAE上评估了17种最新检索方法，发现其性能相较于传统数据集有显著下降。这凸显了需要更好的方法来处理科学研究中复杂、多层面的查询。我们的数据集和代码库已在https://github.com/Real-Doris-Mae/Doris-Mae-Dataset 公开。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日