Emerging multi-model workloads with heavy models like recent large language models significantly increased the compute and memory demands on hardware. To address such increasing demands, designing a scalable hardware architecture became a key problem. Among recent solutions, the 2.5D silicon interposer multi-chip module (MCM)-based AI accelerator has been actively explored as a promising scalable solution due to their significant benefits in the low engineering cost and composability. However, previous MCM accelerators are based on homogeneous architectures with fixed dataflow, which encounter major challenges from highly heterogeneous multi-model workloads due to their limited workload adaptivity. Therefore, in this work, we explore the opportunity in the heterogeneous dataflow MCM AI accelerators. We identify the scheduling of multi-model workload on heterogeneous dataflow MCM AI accelerator is an important and challenging problem due to its significance and scale, which reaches O(10^18) scale even for a single model case on 6x6 chiplets. We develop a set of heuristics to navigate the huge scheduling space and codify them into a scheduler with advanced techniques such as inter-chiplet pipelining. Our evaluation on ten multi-model workload scenarios for datacenter multitenancy and AR/VR use-cases has shown the efficacy of our approach, achieving on average 35.3% and 31.4% less energy-delay product (EDP) for the respective applications settings compared to homogeneous baselines.
翻译:新兴的多模型工作负载(例如近期的大型语言模型等重型模型)显著增加了对硬件的计算和内存需求。为应对这一日益增长的需求,设计可扩展的硬件架构成为关键问题。在最近的解决方案中,基于2.5D硅中介层多芯粒模块(MCM)的AI加速器因其在低工程成本和可组合性方面的显著优势,被积极探索为一种有前景的可扩展方案。然而,先前的MCM加速器基于具有固定数据流的同构架构,这使其在应对高度异构的多模型工作负载时面临重大挑战,因为它对工作负载的适应性有限。因此,本文探索了异构数据流MCM AI加速器中的潜在机遇。我们指出,在异构数据流MCM AI加速器上调度多模型工作负载是一个重要且具有挑战性的问题,原因在于其重要性和规模——即便对于6×6芯粒上的单一模型情形,其搜索空间也达到了O(10¹⁸)量级。我们开发了一系列启发式方法以导航这一庞大的调度空间,并将其整合进一个调度器,同时引入了如芯粒间流水线等先进技术。针对数据中心多租户和AR/VR用例的十种多模型工作负载场景进行的评估证明了我们方法的有效性:与同构基线相比,在相应应用设置下平均实现了35.3%和31.4%的能耗延迟积(EDP)降低。