AI agent inference is driving an inference heavy datacenter future and exposes bottlenecks beyond compute - especially memory capacity, memory bandwidth and high-speed interconnect. We introduce two metrics - Operational Intensity (OI) and Capacity Footprint (CF) - that jointly explain regimes the classic roofline analysis misses, including the memory capacity wall. Across agentic workflows (chat, coding, web use, computer use) and base model choices (GQA/MLA, MoE, quantization), OI/CF can shift dramatically, with long context KV cache making decode highly memory bound. These observations motivate disaggregated serving and system level heterogeneity: specialized prefill and decode accelerators, broader scale up networking, and decoupled compute-memory enabled by optical I/O. We further hypothesize agent-hardware co design, multiple inference accelerators within one system, and high bandwidth, large capacity memory disaggregation as foundations for adaptation to evolving OI/CF. Together, these directions chart a path to sustain efficiency and capability for large scale agentic AI inference.
翻译:AI智能体推理正在推动数据中心向推理密集型未来演进,并暴露出超越计算能力之外的瓶颈——尤其是内存容量、内存带宽与高速互连。我们引入两个度量指标——操作强度(OI)与容量足迹(CF)——二者共同解释了经典屋顶线分析所遗漏的运行状态,包括内存容量墙问题。在不同智能体工作流(对话、编程、网页使用、计算机操作)与基础模型选择(GQA/MLA、MoE、量化)中,OI/CF可能发生剧烈变化,其中长上下文KV缓存会使得解码过程高度受限于内存。这些观察催生了分解式服务与系统级异构化需求:专用预填充与解码加速器、更广泛的纵向扩展网络,以及通过光I/O实现的解耦计算-内存架构。我们进一步提出智能体-硬件协同设计、单系统内多推理加速器集成,以及高带宽大容量内存解耦等设想,将其作为适应动态OI/CF演进的基础。这些方向共同为大规模智能体AI推理的持续效能与能力提升指明了发展路径。