We present Rhino, a system for accelerating tensor programs with automatic parallelization on AI platform for real production environment. It transforms a tensor program written for a single device into an equivalent distributed program that is capable of scaling up to thousands of devices with no user configuration. Rhino firstly works on a semantically independent intermediate representation of tensor programs, which facilitates its generalization to unprecedented applications. Additionally, it implements a task-oriented controller and a distributed runtime for optimal performance. Rhino explores on a complete and systematic parallelization strategy space that comprises all the paradigms commonly employed in deep learning (DL), in addition to strided partitioning and pipeline parallelism on non-linear models. Aiming to efficiently search for a near-optimal parallel execution plan, our analysis of production clusters reveals general heuristics to speed up the strategy search. On top of it, two optimization levels are designed to offer users flexible trade-offs between the search time and strategy quality. Our experiments demonstrate that Rhino can not only re-discover the expert-crafted strategies of classic, research and production DL models, but also identify novel parallelization strategies which surpass existing systems for novel models.
翻译:我们提出Rhino系统,用于在真实生产环境的AI平台上实现张量程序的自动并行化加速。该系统能将单设备编写的张量程序自动转化为等效的分布式程序,无需用户配置即可扩展至数千台设备。Rhino首先作用于语义独立的张量程序中间表示,这使其能泛化至前所未有的应用场景。在此基础上,系统实现了面向任务的控制器和分布式运行时以优化性能。Rhino探索了完整且系统化的并行化策略空间,该空间涵盖深度学习(DL)中所有常用范式,以及针对非线性模型的跨步分割与流水线并行策略。为高效搜索近似最优的并行执行方案,我们对生产集群的分析揭示了加速策略搜索的通用启发式规则。基于此,系统设计了两个优化层级,使用户能在搜索时间与策略质量间进行灵活权衡。实验表明,Rhino不仅能复现经典、研究级及生产级DL模型中专家设计的策略,还能为新型模型识别出超越现有系统的创新并行化策略。