iDDS：面向工作流编排的智能分布式调度与调度系统 (iDDS: Intelligent Distributed Dispatch and Scheduling for Workflow Orchestration)

Wen Guan,Tadashi Maeno,Aleksandr Alekseev,Fernando Harald Barreiro Megino,Kaushik De,Edward Karavakis,Alexei Klimentov,Tatiana Korchuganova,FaHui Lin,Paul Nilsson,Torre Wenaus,Zhaoyu Yang,Xin Zhao

The intelligent Distributed Dispatch and Scheduling (iDDS) service is a versatile workflow orchestration system designed for large-scale, distributed scientific computing. iDDS extends traditional workload and data management by integrating data-aware execution, conditional logic, and programmable workflows, enabling automation of complex and dynamic processing pipelines. Originally developed for the ATLAS experiment at the Large Hadron Collider, iDDS has evolved into an experiment-agnostic platform that supports both template-driven workflows and a Function-as-a-Task model for Python-based orchestration. This paper presents the architecture and core components of iDDS, highlighting its scalability, modular message-driven design, and integration with systems such as PanDA and Rucio. We demonstrate its versatility through real-world use cases: fine-grained tape resource optimization for ATLAS, orchestration of large Directed Acyclic Graph (DAG) workflows for the Rubin Observatory, distributed hyperparameter optimization for machine learning applications, active learning for physics analyses, and AI-assisted detector design at the Electron-Ion Collider. By unifying workload scheduling, data movement, and adaptive decision-making, iDDS reduces operational overhead and enables reproducible, high-throughput workflows across heterogeneous infrastructures. We conclude with current challenges and future directions, including interactive, cloud-native, and serverless workflow support.

翻译：智能分布式调度与调度（iDDS）服务是一个多功能工作流编排系统，专为大规模分布式科学计算而设计。iDDS通过集成数据感知执行、条件逻辑和可编程工作流，扩展了传统的工作负载与数据管理，实现了复杂动态处理流程的自动化。该系统最初为大型强子对撞机的ATLAS实验开发，现已演变为一个与实验无关的平台，支持模板驱动的工作流以及面向Python编排的"函数即任务"模型。本文介绍了iDDS的体系结构和核心组件，重点阐述了其可扩展性、模块化消息驱动设计以及与PanDA、Rucio等系统的集成。我们通过实际用例展示其多功能性：为ATLAS实现细粒度磁带资源优化、为鲁宾天文台编排大型有向无环图工作流、为机器学习应用进行分布式超参数优化、支持物理分析的主动学习，以及在电子-离子对撞机中实现AI辅助探测器设计。通过统一工作负载调度、数据移动和自适应决策，iDDS降低了运维开销，并在异构基础设施上实现了可复现的高通量工作流。最后，我们讨论了当前面临的挑战和未来发展方向，包括交互式、云原生和无服务器工作流支持。