Large language models (LLMs) such as ChatGPT have demonstrated unprecedented capabilities in multiple AI tasks. However, hardware inefficiencies have become a significant factor limiting the democratization of LLMs. We propose Chiplet Cloud, an ASIC supercomputer architecture that optimizes total cost of ownership (TCO) per token for serving generative LLMs. Chiplet Cloud fits all model parameters inside the on-chip SRAMs to eliminate bandwidth limitations while moderating the die size to improve system costs while leveraging software mappings to overcome data communication overhead. We propose a comprehensive design methodology that accurately explores a spectrum of major design trade-offs in the joint space of hardware-software and generates a detailed performance-cost analysis on all valid design points. We evaluate Chiplet Cloud on four popular LLMs. Compared to GPU and TPU, our architecture can achieve up to 94x and 15x improvement in TCO/Token respectively, significantly reducing the cost for realistically serving modern LLMs.
翻译:诸如ChatGPT等大型语言模型(LLMs)在多项人工智能任务中展现了前所未有的能力。然而,硬件效率低下已成为限制LLMs普及的关键因素。我们提出小芯片云(Chiplet Cloud)架构,这是一种面向服务生成式LLMs的专用集成电路(ASIC)超级计算机架构,旨在优化每个令牌(Token)的总拥有成本(TCO)。小芯片云将所有模型参数集成于片上SRAM中,以消除带宽限制,同时适度缩小芯片尺寸以优化系统成本,并借助软件映射方案克服数据通信开销。我们提出了一套全面的设计方法,能够精确探索软硬件联合设计空间中主要设计权衡的广泛范围,并对所有有效设计点生成详细的性能-成本分析。我们在四种主流LLMs上评估了小芯片云。与GPU和TPU相比,我们的架构在TCO/Token指标上分别实现了最高94倍和15倍的提升,显著降低了实际服务现代LLMs的成本。