Chiplet Cloud: Building AI Supercomputers for Serving Large Generative Language Models

Large language models (LLMs) such as OpenAI's ChatGPT and Google's Gemini have demonstrated unprecedented capabilities of autoregressive AI models across multiple tasks triggering disruptive technology innovations around the world. However, as models continue to grow the cost to serve these models also continues to grow threatening the democratization of LLMs. To address this issue, we propose Chiplet Cloud, a chiplet-based ASIC LLM-supercomputer architecture whose goal is to optimize the total cost of ownership (TCO) per generated token. This architecture is a highly parameterizable ASIC and server-level architecture leveraging thousands of replicated accelerator modules collaborating to scale-up the performance of LLMs at cloud-scale. To determine specific parameterizations of the Chiplet Cloud architecture, we implemented a two-phase hardware-software co-design methodology that can search the massive design space and fine tune the architecture across a collection of LLMs based on an accurate inference simulation. A common bottleneck for LLMs is the memory access performance therefore we introduce CC-MEM, a scalable on-chip memory system for Chiplet Cloud architectures. Using the CC-MEM, Chiplet Clouds can be built using only SRAMs for design points where the power and performance of memory access is critical. The CC-MEM also includes a compression decoder module to add support for sparse models without impacting the compute units using a Store-as-Compressed, Load-as-Dense mechanism. We evaluate Chiplet Cloud architectures across eight popular LLMs. Using fine tuned Chiplet Cloud servers we are able to achieve $97\times$ and $18\times$ improvement in TCO/Token over rented GPU and TPU clouds, or a $8.3\times$ and $3.7\times$ improvement over fabricated GPU and TPU clouds respectively. Chiplet Cloud can also support $1.7\times$ larger models with a sparsity of 60\%.

翻译：大型语言模型（LLMs），如OpenAI的ChatGPT和Google的Gemini，已在多项任务中展现出自回归AI模型前所未有的能力，引发了全球范围内的颠覆性技术创新。然而，随着模型规模的持续增长，服务这些模型的成本也水涨船高，威胁着LLMs的民主化。为解决这一问题，我们提出芯粒云——一种基于芯粒的专用集成电路（ASIC）LLM超级计算架构，其目标是优化每个生成代币的总拥有成本（TCO）。该架构是一种高度参数化的ASIC与服务器级架构，利用数千个相互协作的复制加速器模块，在云规模层面扩展LLMs的性能。为了确定芯粒云架构的具体参数化方案，我们实现了一种两阶段软硬件协同设计方法，能够基于精确推理仿真在庞大的设计空间中搜索，并在多个LLMs上对架构进行微调。LLMs的常见瓶颈是内存访问性能，因此我们引入了CC-MEM，一种用于芯粒云架构的可扩展片上内存系统。利用CC-MEM，芯粒云可以仅使用SRAM构建，适用于内存访问功耗与性能至关重要的设计点。CC-MEM还包含一个压缩解码器模块，通过“存储即压缩、加载即密集”机制，在不影响计算单元的情况下支持稀疏模型。我们在八个主流LLMs上评估了芯粒云架构。通过微调后的芯粒云服务器，与租用的GPU云和TPU云相比，每个代币的TCO分别实现了97倍和18倍的改进；与自建的GPU云和TPU云相比，则分别实现了8.3倍和3.7倍的改进。芯粒云还能支持稀疏度为60%时规模大1.7倍的模型。