Conventional LLM inference architectures suffer from high energy and latency due to frequent data movement across memory hierarchies. We propose Ouroboros, a wafer-scale SRAM-based Computing-in-Memory (CIM) architecture that executes all operations in situ, eliminating off-chip migration. To maximize its limited first-level capacity, we introduce three innovations: Token-Grained Pipelining: Replaces sequence-level pipelining to mitigate length variations, boosting utilization and reducing activation storage. Distributed Dynamic KV Cache Management: Decouples memory from compute to leverage fragmented SRAM for efficient KV storage. Communication-Aware Mapping: Optimizes core allocation for locality and fault tolerance across the wafer. Experimental results show Ouroboros achieves average gains of $4.1\times$ in throughput and $4.2\times$ in energy efficiency, peaking at $9.1\times$ and $17\times$ for the 13B model. (*Due to the notification of arXiv "The Abstract field cannot be longer than 1,920 characters", the appeared Abstract is shortened. For the full Abstract, please download the Article.)
翻译:传统的大语言模型推理架构因内存层次间频繁的数据移动而面临高能耗与高延迟问题。本文提出Ouroboros,一种基于晶圆级SRAM的存内计算架构,可在原位执行所有运算,消除片外数据迁移。为最大化其有限的一级存储容量,我们引入三项创新:令牌级流水线——替代序列级流水线以缓解长度变化,提升利用率并减少激活值存储;分布式动态KV缓存管理——将存储与计算解耦,利用碎片化SRAM实现高效KV存储;通信感知映射——优化晶圆上的核心分配以实现数据局部性与容错能力。实验结果表明,Ouroboros在吞吐量和能效上分别实现平均$4.1\times$和$4.2\times$的提升,在13B模型上峰值分别达到$9.1\times$和$17\times$。(*因arXiv“摘要字段不得超过1920字符”的规定,此处为缩减版摘要。完整摘要请下载文章查阅。)