Scientific Workflow Systems (SWSs) such as Nextflow have become essential software frameworks for conducting reproducible, scalable, and portable computational analyses in data-intensive fields like genomics, transcriptomics, and proteomics. Building on Nextflow, the nf-core community curates standardized, peer-reviewed pipelines that follow strict testing, documentation, and governance guidelines. Despite its widespread adoption, little is known about the challenges users face in developing and maintaining these pipelines. This paper presents an empirical study of 25,173 issues and pull requests from these pipelines to uncover recurring challenges, management practices, and perceived difficulties. Using BERTopic modeling, we identify 13 key challenges, including pipeline development and integration, bug fixing, integrating genomic data, managing CI configurations, and handling version updates. We then examine issue-resolution dynamics, showing that 89.38\% of issues and pull requests are eventually closed, with half resolved within 3 days. Statistical analysis reveals that the presence of labels (large effect, $\mathit{d} = 0.94$) and code snippets (medium effect, $\mathit{d} = 0.50$) significantly improves the likelihood of resolution. Further analysis reveals that tool development and repository maintenance poses the most significant challenges, followed by testing pipelines and CI configurations, and debugging containerized pipelines. Overall, this study provides actionable insights into the collaborative development and maintenance of nf-core pipelines, highlighting opportunities to enhance their usability, sustainability, and reproducibility.
翻译:科学工作流系统(SWSs),如Nextflow,已成为在基因组学、转录组学和蛋白质组学等数据密集型领域进行可重复、可扩展和可移植计算分析的重要软件框架。基于Nextflow,nf-core社区策划了一系列遵循严格测试、文档和治理指南的标准化、同行评审流程。尽管其应用广泛,但用户在开发和维护这些流程时面临的具体挑战却鲜为人知。本文通过对这些流程中的25,173个议题和拉取请求进行实证研究,揭示了反复出现的挑战、管理实践以及感知到的困难。利用BERTopic建模,我们识别出13个关键挑战,包括流程开发与集成、错误修复、整合基因组数据、管理CI配置以及处理版本更新。随后,我们考察了议题解决动态,结果显示89.38%的议题和拉取请求最终被关闭,其中一半在3天内得到解决。统计分析表明,标签的存在(大效应量,$\mathit{d} = 0.94$)和代码片段(中等效应量,$\mathit{d} = 0.50$)显著提高了问题解决的可能性。进一步分析揭示,工具开发和仓库维护构成了最重大的挑战,其次是测试流程与CI配置,以及调试容器化流程。总体而言,本研究为nf-core流程的协作开发与维护提供了可操作的见解,并指出了提升其可用性、可持续性和可重复性的潜在机会。