Extracting key information from scientific papers has the potential to help researchers work more efficiently and accelerate the pace of scientific progress. Over the last few years, research on Scientific Information Extraction (SciIE) witnessed the release of several new systems and benchmarks. However, existing paper-focused datasets mostly focus only on specific parts of a manuscript (e.g., abstracts) and are single-modality (i.e., text- or table-only), due to complex processing and expensive annotations. Moreover, core information can be present in either text or tables or across both. To close this gap in data availability and enable cross-modality IE, while alleviating labeling costs, we propose a semi-supervised pipeline for annotating entities in text, as well as entities and relations in tables, in an iterative procedure. Based on this pipeline, we release novel resources for the scientific community, including a high-quality benchmark, a large-scale corpus, and a semi-supervised annotation pipeline. We further report the performance of state-of-the-art IE models on the proposed benchmark dataset, as a baseline. Lastly, we explore the potential capability of large language models such as ChatGPT for the current task. Our new dataset, results, and analysis validate the effectiveness and efficiency of our semi-supervised pipeline, and we discuss its remaining limitations.
翻译:从科研论文中提取关键信息有望帮助研究人员提高工作效率,并加速科学进步的进程。近年来,关于科学信息提取(SciIE)的研究见证了一系列新系统与基准的发布。然而,由于处理流程复杂且标注成本高昂,现有以论文为主的数据集大多仅关注稿件的特定部分(如摘要),且属于单模态(即仅文本或仅表格)。此外,核心信息可能存在于文本、表格或两者之中。为弥合数据可用性方面的这一差距,实现跨模态信息提取,同时降低标注成本,我们提出了一种半监督流水线,通过迭代流程对文本中的实体以及表格中的实体与关系进行标注。基于此流水线,我们为科学界发布了新型资源,包括一个高质量基准、一个大规模语料库以及一个半监督标注流水线。我们进一步报告了当前最先进的信息提取模型在所提出的基准数据集上的性能,以此作为基线。最后,我们探索了诸如ChatGPT之类的大语言模型在当前任务中的潜在能力。我们的新数据集、实验结果及分析验证了所提出的半监督流水线的有效性与高效性,并讨论了其存在的局限性。