Machine Learning (ML) has become ubiquitous, fueling data-driven applications across various organizations. Contrary to the traditional perception of ML in research, ML workflows can be complex, resource-intensive, and time-consuming. Expanding an ML workflow to encompass a wider range of data infrastructure and data types may lead to larger workloads and increased deployment costs. Currently, numerous workflow engines are available (with over ten being widely recognized). This variety poses a challenge for end-users in terms of mastering different engine APIs. While efforts have primarily focused on optimizing ML Operations (MLOps) for a specific workflow engine, current methods largely overlook workflow optimization across different engines. In this work, we design and implement Couler, a system designed for unified ML workflow optimization in the cloud. Our main insight lies in the ability to generate an ML workflow using natural language (NL) descriptions. We integrate Large Language Models (LLMs) into workflow generation, and provide a unified programming interface for various workflow engines. This approach alleviates the need to understand various workflow engines' APIs. Moreover, Couler enhances workflow computation efficiency by introducing automated caching at multiple stages, enabling large workflow auto-parallelization and automatic hyperparameters tuning. These enhancements minimize redundant computational costs and improve fault tolerance during deep learning workflow training. Couler is extensively deployed in real-world production scenarios at Ant Group, handling approximately 22k workflows daily, and has successfully improved the CPU/Memory utilization by more than 15% and the workflow completion rate by around 17%.
翻译:机器学习(ML)已无处不在,推动着各类组织中的数据驱动应用。与科研领域对机器学习的传统认知不同,ML工作流可能复杂、资源密集且耗时。将ML工作流扩展至涵盖更广泛的数据基础设施和数据类型,可能导致更大的工作负载并增加部署成本。目前存在众多工作流引擎(其中超过十种被广泛认可),这种多样性给最终用户掌握不同引擎API带来了挑战。尽管现有工作主要聚焦于针对特定工作流引擎优化ML运维(MLOps),但当前方法基本忽视了跨不同引擎的工作流优化。在本研究中,我们设计并实现了Couler——一个专为云上统一ML工作流优化而设计的系统。核心洞察在于能够利用自然语言描述生成ML工作流。我们将大语言模型(LLMs)集成至工作流生成中,并为多种工作流引擎提供统一编程接口,从而减轻理解各类工作流引擎API的需求。此外,Couler通过引入多阶段自动缓存、支持大型工作流自动并行化及超参数自动调优,提升了工作流计算效率。这些增强措施在深度学习工作流训练过程中最小化了冗余计算成本并提升了容错能力。Couler已在蚂蚁集团实际生产场景中广泛部署,每日处理约22,000个工作流,成功将CPU/内存利用率提升超过15%,工作流完成率提高约17%。