Training of modern large neural networks (NN) requires a combination of parallelization strategies encompassing data, model, or optimizer sharding. When strategies increase in complexity, it becomes necessary for partitioning tools to be 1) expressive, allowing the composition of simpler strategies, and 2) predictable to estimate performance analytically. We present PartIR, our design for a NN partitioning system. PartIR is focused on an incremental approach to rewriting and is hardware-and-runtime agnostic. We present a simple but powerful API for composing sharding strategies and a simulator to validate them. The process is driven by high-level programmer-issued partitioning tactics, which can be both manual and automatic. Importantly, the tactics are specified separately from the model code, making them easy to change. We evaluate PartIR on several different models to demonstrate its predictability, expressibility, and ability to reach peak performance..
翻译:现代大型神经网络(NN)的训练需要结合数据、模型或优化器分片等并行化策略。当策略复杂度提升时,分区工具需要具备:1)表达能力,允许组合简单策略;2)可预测性,以便进行性能分析估算。我们提出PartIR——一个NN分区系统设计方案。PartIR专注于增量式重写方法,且不依赖特定硬件与运行时环境。我们提供了一个简洁但强大的API用于组合分片策略,并开发了模拟器进行验证。整个流程由程序员以高层次方式发布的分区策略驱动,既可手动执行也可自动执行。重要的是,这些策略与模型代码分离定义,便于修改。我们通过多个不同模型对PartIR进行评估,验证了其可预测性、表达能力及达到峰值性能的能力。