Extracting fine-grained experimental findings from literature can provide dramatic utility for scientific applications. Prior work has developed annotation schemas and datasets for limited aspects of this problem, failing to capture the real-world complexity and nuance required. Focusing on biomedicine, this work presents CARE -- a new IE dataset for the task of extracting clinical findings. We develop a new annotation schema capturing fine-grained findings as n-ary relations between entities and attributes, which unifies phenomena challenging for current IE systems such as discontinuous entity spans, nested relations, variable arity n-ary relations and numeric results in a single schema. We collect extensive annotations for 700 abstracts from two sources: clinical trials and case reports. We also demonstrate the generalizability of our schema to the computer science and materials science domains. We benchmark state-of-the-art IE systems on CARE, showing that even models such as GPT4 struggle. We release our resources to advance research on extracting and aggregating literature findings.
翻译:从文献中提取细粒度的实验发现能够为科学应用提供巨大价值。已有研究针对该问题的有限方面开发了标注模式和数据集,但未能捕捉真实世界的复杂性和所需细节。本文聚焦生物医学领域,提出了CARE——一个用于临床发现提取任务的新型信息抽取数据集。我们开发了新的标注模式,将细粒度发现表示为实体与属性间的n元关系,该模式在单一框架内统一了当前信息抽取系统面临的挑战性现象,包括不连续实体跨度、嵌套关系、可变元数n元关系及数值结果。我们从临床试验和病例报告两类来源收集了700篇摘要的详尽标注,并展示了该模式在计算机科学和材料科学领域的泛化能力。我们在CARE上对现有最优信息抽取系统进行了基准测试,结果显示即便如GPT4等模型仍面临挑战。我们公开相关资源,以推动文献发现提取与聚合研究。