LLMs now form the backbone of AI agents across a diverse range of applications, including tool use, command-line interfaces, and web or computer interaction. These agentic LLM inference tasks are fundamentally different from chatbot-focused inference. They often involve much longer context lengths to capture complex and prolonged inputs, such as an entire webpage DOM or complicated tool-call trajectories. This, in turn, generates significant off-chip memory traffic during inference and causes workloads to be constrained by two memory walls, namely the bandwidth wall and the capacity wall, preventing compute units from achieving high utilization. In this paper, we introduce PLENA, a hardware-software co-designed system built around three core optimization pathways. PLENA features a novel flattened systolic-array architecture (Pathway 1) and efficient compute and memory units that support an asymmetric quantization scheme (Pathway 2). It also provides native support for FlashAttention (Pathway 3). In addition, PLENA includes a complete software-hardware stack, consisting of a custom ISA, a compiler, a transaction-level simulator, and an automated design-space exploration flow. Experimental results show that PLENA delivers up to 2.23x and 4.70x higher throughput than the A100 GPU and TPU v6e, respectively, under identical multiplier counts and memory configurations during LLaMA agentic inference. PLENA also achieves up to 4.04x higher energy efficiency than the A100 GPU. The full PLENA system, including its simulator, compiler, ISA, and RTL implementation, will be open-sourced to the research community.
翻译:大型语言模型(LLM)现已成为支撑AI智能体的核心,广泛应用于工具调用、命令行界面以及网络或计算机交互等场景。此类智能体LLM推理任务与以聊天机器人为主的推理任务存在本质区别。它通常需要处理更长的上下文长度,以捕获复杂且持久的输入信息,例如整个网页的文档对象模型(DOM)或复杂的工具调用轨迹。这导致推理过程中产生大量片外内存流量,并使工作负载受限于两大内存壁障,即带宽壁障和容量壁障,从而阻碍计算单元实现高利用率。本文提出PLENA,一个基于三大核心优化路径的软硬件协同设计系统。PLENA采用新型平坦化脉动阵列架构(路径1),并配备支持非对称量化方案(路径2)的高效计算与存储单元。此外,PLENA原生支持FlashAttention(路径3)。PLENA还包含完整的软硬件栈,由定制指令集架构(ISA)、编译器、事务级模拟器以及自动化设计空间探索流程构成。实验结果表明,在LLaMA智能体推理过程中,PLENA在相同乘法器数量和内存配置下,吞吐量分别达到A100 GPU和TPU v6e的2.23倍和4.70倍。同时,PLENA的能效比A100 GPU提升高达4.04倍。完整的PLENA系统(包括模拟器、编译器、ISA和寄存器传输级(RTL)实现)将向研究社区开源。