Along with the fast evolution of deep neural networks, the hardware system is also developing rapidly. As a promising solution achieving high scalability and low manufacturing cost, multi-accelerator systems widely exist in data centers, cloud platforms, and SoCs. Thus, a challenging problem arises in multi-accelerator systems: selecting a proper combination of accelerators from available designs and searching for efficient DNN mapping strategies. To this end, we propose MARS, a novel mapping framework that can perform computation-aware accelerator selection, and apply communication-aware sharding strategies to maximize parallelism. Experimental results show that MARS can achieve 32.2% latency reduction on average for typical DNN workloads compared to the baseline, and 59.4% latency reduction on heterogeneous models compared to the corresponding state-of-the-art method.
翻译:随着深度神经网络的快速发展,硬件系统也正在迅速演进。作为一种在实现高可扩展性和低制造成本方面具有前景的解决方案,多加速器系统广泛应用于数据中心、云平台及片上系统(SoC)中。由此在多加速器系统中产生了一个具有挑战性的问题:如何从可用设计中选取合适的加速器组合,并搜索高效的DNN映射策略。为此,我们提出MARS——一种新颖的映射框架,它能执行计算感知的加速器选择,并应用通信感知的分片策略以最大化并行性。实验结果表明,与基线方案相比,MARS在典型DNN工作负载上平均可实现32.2%的延迟降低;而在异构模型上,与对应的最先进方法相比,延迟降低可达59.4%。