This work pursues automated planning and scheduling of distributed data pipelines, or workflows. We develop a general workflow and resource graph representation that includes both data processing and sharing components with corresponding network interfaces for scheduling. Leveraging these graphs, we introduce WORKSWORLD, a new domain for numeric domain-independent planners designed for permanently scheduled workflows, like ingest pipelines. Our framework permits users to define data sources, available workflow components, and desired data destinations and formats without explicitly declaring the entire workflow graph as a goal. The planner solves a joint planning and scheduling problem, producing a plan that both builds the workflow graph and schedules its components on the resource graph. We empirically show that a state-of-the-art numeric planner running on commodity hardware with one hour of CPU time and 30GB of memory can solve linear-chain workflows of up to 14 components across eight sites.
翻译:本工作旨在实现分布式数据流水线(即工作流)的自动规划与调度。我们提出了一种通用工作流与资源图表示方法,该方法包含数据处理和共享组件以及对应的网络接口,可用于调度。基于这些图,我们引入了WORKSWORLD,这是一个新的数值领域无关规划器领域,专为永久性调度工作流(如数据摄取流水线)设计。我们的框架允许用户定义数据源、可用工作流组件以及所需的数据目标与格式,而无需显式地将整个工作流图声明为目标。规划器解决了一个联合规划与调度问题,生成一个计划,该计划既构建工作流图,又将其组件调度到资源图上。我们通过实验证明,在商品硬件上运行、使用一小时CPU时间和30GB内存的最先进数值规划器,可以求解多达14个组件、跨八个站点的线性链工作流。