Scientific Workflow Management Systems (SWfMSs) such as Nextflow have become essential software frameworks for conducting reproducible, scalable, and portable computational analyses in data-intensive fields like genomics, transcriptomics, and proteomics. Building on Nextflow, the nf-core community curates standardized, peer-reviewed pipelines that follow strict testing, documentation, and governance guidelines. Despite its broad adoption, little is known about the challenges users face during the development and maintenance of these pipelines. This paper presents an empirical study of 25,173 issues and pull requests from these pipelines to uncover recurring challenges, management practices, and perceived difficulties. Using BERTopic modeling, we identify 13 key challenges, including pipeline development and integration, bug fixing, integrating genomic data, managing CI configurations, and handling version updates. We then examine issue resolution dynamics, showing that 89.38\% of issues and pull requests are eventually closed, with half resolved within three days. Statistical analysis reveals that the presence of labels (large effect, $δ$ = 0.94) and code snippets (medium effect, $δ$ = 0.50) significantly improve resolution likelihood. Further analysis reveals that tool development and repository maintenance poses the most significant challenges, followed by testing pipelines and CI configurations, and debugging containerized pipelines. Overall, this study provides actionable insights into the collaborative development and maintenance of nf-core pipelines, highlighting opportunities to enhance their usability, sustainability, and reproducibility.
翻译:科学工作流管理系统(SWfMSs),如 Nextflow,已成为在基因组学、转录组学和蛋白质组学等数据密集型领域开展可重复、可扩展且可移植的计算分析的关键软件框架。基于 Nextflow,nf-core 社区维护着遵循严格测试、文档和治理准则的标准化、同行评审流程。尽管其应用广泛,但用户在开发和维护这些流程过程中面临的挑战却鲜为人知。本文通过对来自这些流程的 25,173 个 issues 和 pull requests 进行实证研究,以揭示反复出现的挑战、管理实践和感知到的困难。利用 BERTopic 建模,我们识别出 13 个关键挑战,包括流程开发与集成、错误修复、集成基因组数据、管理 CI 配置以及处理版本更新。随后,我们考察了 issue 解决动态,结果表明 89.38\% 的 issues 和 pull requests 最终被关闭,其中一半在三天内得到解决。统计分析显示,标签(大效应,$δ$ = 0.94)和代码片段(中等效应,$δ$ = 0.50)的存在显著提高了解决的可能性。进一步分析表明,工具开发和仓库维护构成了最重大的挑战,其次是测试流程和 CI 配置,以及调试容器化流程。总体而言,本研究为 nf-core 流程的协作开发与维护提供了可操作的见解,并指出了提升其可用性、可持续性和可重复性的机遇。