Efficient deployment of Deep Neural Networks (DNNs), such as Large Language Models (LLMs), on tensor accelerators is essential for maximizing computational efficiency in modern AI systems. However, achieving this is challenging due to the enormous and complex design space created by the interaction of intra-layer mapping and inter-layer fusion. In this work, we present FADiff, a gradient-based optimization framework capable of automatically identifying high-quality intra-layer mapping and inter-layer fusion strategies to accelerate inference for DNN workloads. We first construct a unified and differentiable analytical cost model, which accurately predicts the energy and latency of both single-layer mappings and various layer fusion strategies. Then, by encoding discrete constraints into the loss function, we employ a gradient-based approach to efficiently explore the vast design space, determining the optimal joint strategy for mapping and fusion. Experimental results demonstrate the superiority of FADiff, achieving better optimization in terms of energy and latency compared to existing methods.
翻译:在张量加速器上高效部署深度神经网络(DNN),例如大语言模型(LLM),对于最大化现代人工智能系统的计算效率至关重要。然而,由于层内映射与层间融合相互交织所形成的庞大而复杂的设计空间,实现这一目标极具挑战性。在本工作中,我们提出了FADiff,一种基于梯度的优化框架,能够自动识别高质量的层内映射与层间融合策略,以加速DNN工作负载的推理。我们首先构建了一个统一且可微分的解析成本模型,该模型能够准确预测单层映射及多种层融合策略的能耗与延迟。随后,通过将离散约束编码至损失函数中,我们采用基于梯度的方法高效探索广阔的设计空间,从而确定映射与融合的最优联合策略。实验结果表明,FADiff相较于现有方法,在能耗与延迟方面实现了更优的优化效果。