Into the Single Cell Multiverse: an End-to-End Dataset for Procedural Knowledge Extraction in Biomedical Texts

Many of the most commonly explored natural language processing (NLP) information extraction tasks can be thought of as evaluations of declarative knowledge, or fact-based information extraction. Procedural knowledge extraction, i.e., breaking down a described process into a series of steps, has received much less attention, perhaps in part due to the lack of structured datasets that capture the knowledge extraction process from end-to-end. To address this unmet need, we present FlaMB\'e (Flow annotations for Multiverse Biological entities), a collection of expert-curated datasets across a series of complementary tasks that capture procedural knowledge in biomedical texts. This dataset is inspired by the observation that one ubiquitous source of procedural knowledge that is described as unstructured text is within academic papers describing their methodology. The workflows annotated in FlaMB\'e are from texts in the burgeoning field of single cell research, a research area that has become notorious for the number of software tools and complexity of workflows used. Additionally, FlaMB\'e provides, to our knowledge, the largest manually curated named entity recognition (NER) and disambiguation (NED) datasets for tissue/cell type, a fundamental biological entity that is critical for knowledge extraction in the biomedical research domain. Beyond providing a valuable dataset to enable further development of NLP models for procedural knowledge extraction, automating the process of workflow mining also has important implications for advancing reproducibility in biomedical research.

翻译：摘要：许多广泛探索的自然语言处理（NLP）信息抽取任务可被视为对陈述性知识或基于事实的信息提取的评估。程序性知识提取（即，将描述的过程分解为一系列步骤）受到的关注相对较少，部分原因可能是缺乏能够端到端捕获知识提取过程的结构化数据集。为满足这一未被解决的需求，我们提出了FlaMB'e（Flow annotations for Multiverse Biological entities，多元宇宙生物实体流程标注），这是一个包含一系列互补任务的专家策划数据集集合，旨在捕获生物医学文本中的程序性知识。该数据集的灵感来源于一个观察：以非结构化文本描述的程序性知识的一个普遍来源是描述其方法的学术论文。FlaMB'e中标注的工作流来自蓬勃发展的单细胞研究领域的文本，这一研究领域因使用的软件工具数量之多和工作流的复杂性之高而闻名。此外，据我们所知，FlaMB'e提供了目前为止最大规模的人工筛选命名实体识别（NER）和消歧（NED）数据集，这些数据集针对组织/细胞类型，这是一种基本的生物学实体，对于生物医学研究领域的知识提取至关重要。除了提供一个有价值的数据集以促进程序性知识提取NLP模型的进一步开发外，自动化工作流挖掘过程对于推动生物医学研究的可重复性也具有重要意义。