The emergence of the large language model~(LLM) poses an exponential growth of demand for computation throughput, memory capacity, and communication bandwidth. Such a demand growth has significantly surpassed the improvement of corresponding chip designs. With the advancement of fabrication and integration technologies, designers have been developing Wafer-Scale Chips(WSCs) to scale up and exploit the limits of computation density, memory capacity, and communication bandwidth at the level of a single chip. Existing solutions have demonstrated the significant advantages of WSCs over traditional designs, showing potential to effectively support LLM workloads. Despite the benefits, exploring the early-stage design space of WSCs for LLMs is a crucial yet challenging task due to the enormous and complicated design space, time-consuming evaluation methods, and inefficient exploration strategies. To address these challenges, we propose Theseus, an efficient WSC design space exploration framework for LLMs. We construct the design space of WSCs with various constraints considering the unique characteristics of WSCs. We propose efficient evaluation methodologies for large-scale NoC-based WSCs and introduce multi-fidelity Bayesian optimization to efficiently explore the design space. Evaluation results demonstrate the efficiency of Theseus that the searched Pareto optimal results outperform GPU cluster and existing WSC designs by up to 62.8%/73.7% in performance and 38.6%/42.4% in power consumption for LLM training, while improving up to 23.2$\times$ and 15.7$\times$ for the performance and power of inference tasks. Furthermore, we conduct case studies to address the design tradeoffs in WSCs and provide insights to facilitate WSC designs for LLMs.
翻译:大语言模型(LLM)的出现对计算吞吐量、存储容量和通信带宽的需求呈指数级增长。这种需求增长已显著超越相应芯片设计的改进速度。随着制造与集成技术的进步,设计者开始开发晶圆级芯片(WSC),以在单芯片层面提升并逼近计算密度、存储容量与通信带宽的极限。现有方案已证明WSC相较于传统设计的显著优势,展现出有效支撑LLM工作负载的潜力。然而,尽管存在这些优势,针对LLM的WSC早期设计空间探索仍是一项关键且极具挑战的任务,原因在于其庞大复杂的设计空间、耗时的评估方法以及低效的探索策略。为应对这些挑战,本文提出Theseus——一个面向LLM的高效WSC设计空间探索框架。我们结合WSC的独特特性构建了包含多种约束条件的设计空间,提出了面向大规模基于片上网络(NoC)的WSC的高效评估方法,并引入多保真度贝叶斯优化以高效探索设计空间。评估结果表明,Theseus具有高效性:其搜索得到的帕累托最优结果在LLM训练任务中,性能分别超越GPU集群与现有WSC设计最高达62.8%/73.7%,功耗降低达38.6%/42.4%;在推理任务中,性能与功耗分别提升达23.2倍与15.7倍。此外,我们通过案例研究探讨了WSC中的设计权衡,并为面向LLM的WSC设计提供了指导性见解。