Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform

We present Rhino, a system for accelerating tensor programs with automatic parallelization on AI platform for real production environment. It transforms a tensor program written for a single device into an equivalent distributed program that is capable of scaling up to thousands of devices with no user configuration. Rhino firstly works on a semantically independent intermediate representation of tensor programs, which facilitates its generalization to unprecedented applications. Additionally, it implements a task-oriented controller and a distributed runtime for optimal performance. Rhino explores on a complete and systematic parallelization strategy space that comprises all the paradigms commonly employed in deep learning (DL), in addition to strided partitioning and pipeline parallelism on non-linear models. Aiming to efficiently search for a near-optimal parallel execution plan, our analysis of production clusters reveals general heuristics to speed up the strategy search. On top of it, two optimization levels are designed to offer users flexible trade-offs between the search time and strategy quality. Our experiments demonstrate that Rhino can not only re-discover the expert-crafted strategies of classic, research and production DL models, but also identify novel parallelization strategies which surpass existing systems for novel models.

翻译：我们提出Rhino系统，用于在真实生产环境的AI平台上实现张量程序的自动并行化加速。该系统能将单设备编写的张量程序自动转化为等效的分布式程序，无需用户配置即可扩展至数千台设备。Rhino首先作用于语义独立的张量程序中间表示，这使其能泛化至前所未有的应用场景。在此基础上，系统实现了面向任务的控制器和分布式运行时以优化性能。Rhino探索了完整且系统化的并行化策略空间，该空间涵盖深度学习（DL）中所有常用范式，以及针对非线性模型的跨步分割与流水线并行策略。为高效搜索近似最优的并行执行方案，我们对生产集群的分析揭示了加速策略搜索的通用启发式规则。基于此，系统设计了两个优化层级，使用户能在搜索时间与策略质量间进行灵活权衡。实验表明，Rhino不仅能复现经典、研究级及生产级DL模型中专家设计的策略，还能为新型模型识别出超越现有系统的创新并行化策略。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

手册《兵棋推演：工具、技术和程序》33页slides，Connections UK – Wargaming for Professionals

专知会员服务

42+阅读 · 2022年10月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日