Increasingly complex and diverse deep neural network (DNN) models necessitate distributing the execution across multiple devices for training and inference tasks, and also require carefully planned schedules for performance. However, existing practices often rely on predefined schedules that may not fully exploit the benefits of emerging diverse model-aware operator placement strategies. Handcrafting high-efficiency schedules can be challenging due to the large and varying schedule space. This paper presents Tessel, an automated system that searches for efficient schedules for distributed DNN training and inference for diverse operator placement strategies. To reduce search costs, Tessel leverages the insight that the most efficient schedules often exhibit repetitive pattern (repetend) across different data inputs. This leads to a two-phase approach: repetend construction and schedule completion. By exploring schedules for various operator placement strategies, Tessel significantly improves both training and inference performance. Experiments with representative DNN models demonstrate that Tessel achieves up to 5.5x training performance speedup and up to 38% inference latency reduction.
翻译:日益复杂多样化的深度神经网络(DNN)模型要求跨多个设备分布执行训练与推理任务,并需要精心设计的调度方案以实现高性能。然而,现有实践往往依赖预定义的调度策略,这些策略可能无法充分利用新兴的多样化模型感知算子放置策略的优势。由于调度空间规模庞大且动态变化,手工设计高效调度面临挑战。本文提出Tessel——一个自动化系统,可为分布式DNN训练与推理搜索面向多样化算子放置策略的高效调度方案。为降低搜索成本,Tessel利用一个关键洞见:最高效的调度通常在不同数据输入间呈现重复模式(repetend)。由此形成两阶段方法:重复模式构建与调度补全。通过探索多种算子放置策略下的调度方案,Tessel显著提升了训练与推理性能。在代表性DNN模型上的实验表明,Tessel可实现最高5.5倍的训练性能加速,并降低高达38%的推理延迟。