Bioinformatics workflows are essential for complex biological data analyses and are often described in scientific articles with source code in public repositories. Extracting detailed workflow information from articles can improve accessibility and reusability but is hindered by limited annotated corpora. To address this, we framed the problem as a low-resource extraction task and tested four strategies: 1) creating a tailored annotated corpus, 2) few-shot named-entity recognition (NER) with an autoregressive language model, 3) NER using masked language models with existing and new corpora, and 4) integrating workflow knowledge into NER models. Using BioToFlow, a new corpus of 52 articles annotated with 16 entities, a SciBERT-based NER model achieved a 70.4 F-measure, comparable to inter-annotator agreement. While knowledge integration improved performance for specific entities, it was less effective across the entire information schema. Our results demonstrate that high-performance information extraction for bioinformatics workflows is achievable.
翻译:生物信息学工作流对于复杂的生物数据分析至关重要,通常通过科学论文进行描述,其源代码存放于公共存储库中。从论文中提取详细的工作流信息可提升其可访问性与可复用性,但受限于标注语料库的稀缺。为解决此问题,我们将该任务构建为低资源信息抽取问题,并测试了四种策略:1)构建定制化标注语料库;2)使用自回归语言模型进行少样本命名实体识别(NER);3)利用掩码语言模型结合现有及新构建的语料库进行NER;4)将工作流知识整合至NER模型中。基于新构建的BioToFlow语料库(包含52篇标注了16类实体的论文),采用SciBERT的NER模型取得了70.4的F值,与标注者间一致性水平相当。尽管知识整合对特定实体的识别性能有所提升,但在整体信息架构上的改进有限。我们的结果表明,针对生物信息学工作流实现高性能信息抽取是可行的。