Motivation: The rapid growth of biological data has intensified the need for transparent, reproducible, and well-documented computational workflows. The ability to clearly connect the steps of a workflow in the code with their description in a paper would improve workflow understanding, support reproducibility, and facilitate reuse. This task requires the linking of Bioinformatics tools in workflow code with their mentions in a published workflow description. Results: We present CoPaLink, an automated approach that integrates three components: Named Entity Recognition (NER) for identifying tool mentions in scientific text, NER for tool mentions in workflow code, and entity linking grounded on Bioinformatics knowledge bases. We propose approaches for all three steps achieving a high individual F1-measure (84 - 89) and a joint accuracy of 66 when evaluated on Nextflow workflows using Bioconda and Bioweb Knowledge bases. CoPaLink leverages corpora of scientific articles and workflow executable code with curated tool annotations to bridge the gap between narrative descriptions and workflow implementations. Availability: The code is available at https://gitlab.liris.cnrs.fr/sharefair/copalink-experiments and https://gitlab.liris.cnrs.fr/sharefair/copalink. The corpora are also available at https://doi.org/10.5281/zenodo.18526700, https://doi.org/10.5281/zenodo.18526760 and https://doi.org/10.5281/zenodo.18543814.
翻译:动机:生物数据的快速增长加强了对透明、可复现且文档完备的计算工作流的需求。将代码中工作流步骤与其在论文中的描述清晰关联,将提升工作流理解度、支持可复现性并促进重用。此任务需要将工作流代码中的生物信息学工具与其在已发表工作流描述中的提及进行关联。结果:我们提出CoPaLink,一种集成三个组件的自动化方法:用于识别科学文本中工具提及的命名实体识别(NER)、用于识别工作流代码中工具提及的NER,以及基于生物信息学知识库的实体链接。我们针对所有三个步骤提出了相应方法,在使用Bioconda和Bioweb知识库对Nextflow工作流进行评估时,各步骤取得了较高的独立F1值(84-89),联合准确率达到66。CoPaLink利用带有标注工具注释的科学文章语料库和工作流可执行代码语料库,弥合了叙述性描述与工作流实现之间的鸿沟。可用性:代码发布于https://gitlab.liris.cnrs.fr/sharefair/copalink-experiments 与 https://gitlab.liris.cnrs.fr/sharefair/copalink。语料库亦发布于https://doi.org/10.5281/zenodo.18526700、https://doi.org/10.5281/zenodo.18526760 及 https://doi.org/10.5281/zenodo.18543814。