This work pursues automated planning and scheduling of distributed data pipelines, or workflows. We develop a general workflow and resource graph representation that includes both data processing and sharing components with corresponding network interfaces for scheduling. Leveraging these graphs, we introduce WORKSWORLD, a new domain for numeric domain-independent planners designed for permanently scheduled workflows, like ingest pipelines. Our framework permits users to define data sources, available workflow components, and desired data destinations and formats without explicitly declaring the entire workflow graph as a goal. The planner solves a joint planning and scheduling problem, producing a plan that both builds the workflow graph and schedules its components on the resource graph. We empirically show that a state-of-the-art numeric planner running on commodity hardware with one hour of CPU time and 30GB of memory can solve linear-chain workflows of up to 14 components across eight sites.
翻译:本文致力于实现分布式数据流水线(或称工作流)的自动化规划与调度。我们开发了一种通用的工作流与资源图表示方法,该表示包含了数据处理与共享组件,并配备了相应的网络接口用于调度。基于这些图结构,我们引入了WORKSWORLD——这是一个专为永久性调度工作流(如数据摄取流水线)设计的新领域,适用于领域无关的数值规划器。我们的框架允许用户定义数据源、可用工作流组件以及期望的数据目的地和格式,而无需将整个工作流图显式声明为目标。规划器通过求解一个联合规划与调度问题,生成一个既能构建工作流图,又能将其组件调度到资源图上的规划方案。实验表明,一台配备一小时CPU时间和30GB内存的商用硬件上运行的最先进数值规划器,能够解决跨八个站点、包含多达14个组件的线性链式工作流。