Large language models (LLMs) such as ChatGPT have demonstrated unprecedented capabilities in multiple AI tasks. However, hardware inefficiencies have become a significant factor limiting the democratization of LLMs. We propose Chiplet Cloud, an ASIC supercomputer architecture that optimizes total cost of ownership (TCO) per token for serving generative LLMs. Chiplet Cloud fits all model parameters inside the on-chip SRAMs to eliminate bandwidth limitations while moderating the die size to improve system costs while leveraging software mappings to overcome data communication overhead. We propose a comprehensive design methodology that accurately explores a spectrum of major design trade-offs in the joint space of hardware-software and generates a detailed performance-cost analysis on all valid design points. We evaluate Chiplet Cloud on four popular LLMs. Compared to GPU and TPU, our architecture can achieve up to 94x and 15x improvement in TCO/Token respectively, significantly reducing the cost for realistically serving modern LLMs.
翻译:大型语言模型(如ChatGPT)在多种人工智能任务中展现出前所未有的能力。然而,硬件效率低下已成为限制大型语言模型普及的关键因素。我们提出小芯片云(Chiplet Cloud)——一种优化生成式大语言模型服务中每token总拥有成本(TCO)的ASIC超级计算机架构。该架构将所有模型参数置于片上SRAM中以消除带宽限制,同时通过控制芯片尺寸降低系统成本,并借助软件映射策略克服数据通信开销。我们提出了一套完整的设计方法论,可在硬件-软件联合设计空间中精确探索主要设计权衡的完整光谱,生成所有有效设计点的详细性能-成本分析。我们在四种主流大语言模型上评估了小芯片云。与GPU和TPU相比,我们的架构在TCO/Token指标上分别实现了最高94倍和15倍的提升,显著降低了现代大语言模型实际部署的成本。