Large language models (LLMs) such as ChatGPT have demonstrated unprecedented capabilities in multiple AI tasks. However, hardware inefficiencies have become a significant factor limiting the democratization of LLMs. We propose Chiplet Cloud, an ASIC supercomputer architecture that optimizes total cost of ownership (TCO) per token for serving generative LLMs. Chiplet Cloud fits all model parameters inside the on-chip SRAMs to eliminate bandwidth limitations while moderating the die size to improve system costs while leveraging software mappings to overcome data communication overhead. We propose a comprehensive design methodology that accurately explores a spectrum of major design trade-offs in the joint space of hardware-software and generates a detailed performance-cost analysis on all valid design points. We evaluate Chiplet Cloud on four popular LLMs. Compared to GPU and TPU, our architecture can achieve up to 94x and 15x improvement in TCO/Token respectively, significantly reducing the cost for realistically serving modern LLMs.
翻译:以ChatGPT为代表的大型语言模型(LLMs)已在多项人工智能任务中展现出前所未有的能力。然而,硬件效率低下已成为制约LLMs普及的关键因素。本文提出小芯片云(Chiplet Cloud)这一专用集成电路(ASIC)超级计算机架构,旨在优化服务生成式LLMs时每个词元的总体拥有成本(TCO/Token)。该架构将全部模型参数集成于片上SRAM以消除带宽瓶颈,同时通过控制芯片尺寸降低系统成本,并借助软件映射策略克服数据传输开销。我们提出了一套完整的设计方法论,可在软硬件联合设计空间中精准探索关键设计权衡,并对所有有效设计点生成详细的性能-成本分析。在四种主流LLMs上的评估表明,与GPU和TPU相比,本架构在TCO/Token指标上分别实现了最高94倍和15倍的提升,显著降低了现代LLMs实际部署的服务成本。