High throughput extraction and structured labeling of data from academic articles is critical to enable downstream machine learning applications and secondary analyses. We have embedded multimodal data curation into the academic publishing process to annotate segmented figure panels and captions. Natural language processing (NLP) was combined with human-in-the-loop feedback from the original authors to increase annotation accuracy. Annotation included eight classes of bioentities (small molecules, gene products, subcellular components, cell lines, cell types, tissues, organisms, and diseases) plus additional classes delineating the entities' roles in experiment designs and methodologies. The resultant dataset, SourceData-NLP, contains more than 620,000 annotated biomedical entities, curated from 18,689 figures in 3,223 articles in molecular and cell biology. We evaluate the utility of the dataset to train AI models using named-entity recognition, segmentation of figure captions into their constituent panels, and a novel context-dependent semantic task assessing whether an entity is a controlled intervention target or a measurement object. We also illustrate the use of our dataset in performing a multi-modal task for segmenting figures into panel images and their corresponding captions.
翻译:从学术文章中高通量提取和结构化标注数据对于实现下游机器学习应用和二次分析至关重要。我们将多模态数据策展嵌入学术出版流程,以标注分割后的图版及其说明文字。结合自然语言处理(NLP)与原作者的"人在回路"反馈,提高了标注准确性。标注内容包括八类生物实体(小分子、基因产物、亚细胞组分、细胞系、细胞类型、组织、生物体和疾病),以及描述这些实体在实验设计和方法中作用的其他类别。由此产生的数据集SourceData-NLP包含超过62万个已标注的生物医学实体,这些实体源自分子与细胞生物学领域3,223篇文章中的18,689幅图示。我们通过命名实体识别、将图注分割为对应图版组件,以及一项新颖的上下文相关语义任务(评估实体是受控干预目标还是测量对象),评估了该数据集在训练AI模型方面的效用。我们还展示了该数据集在执行多模态任务中的应用,即将图示分割为图版图像及其对应说明文字。