Several methods exist today to accelerate Machine Learning(ML) or Deep-Learning(DL) model performance for training and inference. However, modern techniques that rely on various graph and operator parallelism methodologies rely on search space optimizations which are costly in terms of power and hardware usage. Especially in the case of inference, when the batch size is 1 and execution is on CPUs or for power-constrained edge devices, current techniques can become costly, complicated or inapplicable. To ameliorate this, we present a Critical-Path-based Linear Clustering approach to exploit inherent parallel paths in ML dataflow graphs. Our task parallelization approach further optimizes the structure of graphs via cloning and prunes them via constant propagation and dead-code elimination. Contrary to other work, we generate readable and executable parallel Pytorch+Python code from input ML models in ONNX format via a new tool that we have built called {\bf Ramiel}. This allows us to benefit from other downstream acceleration techniques like intra-op parallelism and potentially pipeline parallelism. Our preliminary results on several ML graphs demonstrate up to 1.9$\times$ speedup over serial execution and outperform some of the current mechanisms in both compile and runtimes. Lastly, our methods are lightweight and fast enough so that they can be used effectively for power and resource-constrained devices, while still enabling downstream optimizations.
翻译:目前存在多种加速机器学习或深度学习模型训练与推理性能的方法。然而,依赖各种图和算子并行方法学的现代技术需要搜索空间优化,这在功耗和硬件使用方面代价高昂。尤其在批大小为1且在CPU上执行或针对功耗受限的边缘设备进行推理时,现有技术可能变得成本高昂、复杂或不可行。为改善这一问题,我们提出了一种基于关键路径的线性聚类方法,以利用ML数据流图中固有的并行路径。我们的任务并行化方法通过克隆进一步优化图结构,并通过常量传播和死代码消除对图进行剪枝。与其他研究不同,我们通过自主构建的新工具**Ramiel**,从ONNX格式的输入ML模型中生成可读且可执行的并行PyTorch+Python代码。这使我们能够受益于其他下游加速技术(如算子内并行和潜在的流水线并行)。我们在多个ML图上的初步实验结果表明,与串行执行相比,该方法可实现最高1.9倍的加速,并在编译时和运行时上均优于部分现有机制。最后,我们的方法轻量且高效,可有效应用于功耗和资源受限设备,同时仍支持下游优化。