The rise of generative AI workloads, particularly language model inference, is intensifying on/off-chip memory pressure. Multimodal inputs such as video streams or images and downstream applications like Question Answering (QA) and analysis over large documents incur long context lengths, requiring caching of massive Key and Value states of the previous tokens. Even a low degree of concurrent inference serving on resource-constrained devices, like mobiles, can further add to memory capacity pressure and runtime memory management complexity. In this paper, we evaluate the performance implications of two emerging technology solutions to alleviate the memory pressure in terms of both capacity and bandwidth using a hierarchical roofline-based analytical performance model. For large models (e.g., 13B parameters) and context lengths, we investigate the performance implications of High Bandwidth Storage (HBS) and outline bandwidth/latency requirements to achieve an acceptable throughput for interactivity. For small models (e.g., 1B parameters), we evaluate the merit of a bonded global buffer memory chiplet and propose how to best utilize it.
翻译:生成式AI工作负载的兴起,特别是语言模型推理,正在加剧片上/片外存储器压力。多模态输入(如视频流或图像)以及下游应用(如问答系统和大文档分析)会产生长上下文长度,需要缓存先前令牌的大量键值状态。即使在资源受限设备(如手机)上运行低并发推理服务,也会进一步增加内存容量压力和运行时内存管理复杂性。本文利用基于层次化屋顶线的分析性能模型,评估了两种新兴技术解决方案在缓解内存容量和带宽压力方面的性能影响。针对大模型(如130亿参数)和长上下文长度,我们研究了高带宽存储器(HBS)的性能影响,并概述了在交互场景中实现可接受吞吐量所需的带宽/延迟要求。针对小模型(如10亿参数),我们评估了绑定全局缓冲存储器芯片的优点,并提出了最佳利用方案。