In the AI-for-science era, scientific computing scenarios such as concurrent learning and high-throughput computing demand a new generation of infrastructure that supports scalable computing resources and automated workflow management on both cloud and high-performance supercomputers. Here we introduce Dflow, an open-source Python toolkit designed for scientists to construct workflows with simple programming interfaces. It enables complex process control and task scheduling across a distributed, heterogeneous infrastructure, leveraging containers and Kubernetes for flexibility. Dflow is highly observable and can scale to thousands of concurrent nodes per workflow, enhancing the efficiency of complex scientific computing tasks. The basic unit in Dflow, known as an Operation (OP), is reusable and independent of the underlying infrastructure or context. Dozens of workflow projects have been developed based on Dflow, spanning a wide range of projects. We anticipate that the reusability of Dflow and its components will encourage more scientists to publish their workflows and OP components. These components, in turn, can be adapted and reused in various contexts, fostering greater collaboration and innovation in the scientific community.
翻译:在人工智能驱动科学(AI-for-Science)时代,并发学习和高通量计算等科学计算场景需要新一代基础设施,以支持云计算与高性能超级计算机上的弹性资源分配与自动化工作流管理。本文介绍Dflow——一个面向科学家的开源Python工具包,通过简洁的编程接口构建工作流。该系统能够实现跨分布式异构基础设施的复杂流程控制与任务调度,利用容器与Kubernetes技术保证灵活性。Dflow具备高度可观测性,单个工作流可扩展至数千个并发节点,从而提升复杂科学计算任务的效率。Dflow中的基本单元称为操作(Operation, OP),具有可复用性且不依赖底层基础设施或上下文。目前基于Dflow已开发出数十个覆盖广泛领域的工作流项目。我们预期,Dflow及其组件的可复用性将激励更多科学家发布其工作流与OP组件,这些组件可在不同场景中被适配复用,从而促进科学界更深入的协作与创新。