The computational and memory demands of large language models for generative inference present significant challenges for practical deployment. One promising solution targeting offline inference is offloading-based batched inference, which extends the GPU's memory hierarchy with host memory and storage. However, it often suffers from substantial I/O overhead, primarily due to the large KV cache sizes that scale with batch size and context window length. In this paper, we introduce HILOS, a framework that boosts offline inference throughput using near-storage processing. The core of HILOS is attention near storage, which offloads memory-intensive attention operations to near-storage accelerators, reducing traffic across the system interconnect. Building on attention near storage, HILOS incorporates three additional optimizations. First, cooperative X-cache minimizes KV cache I/O by exploiting available host resources after offloading. Second, delayed KV cache writeback hides storage write latency and mitigates storage write amplification. Finally, a memory-efficient attention accelerator sustains high throughput for long sequences within the resource constraints of NSP devices. We implemented and evaluated HILOS on a real system equipped with 16 SmartSSDs. Compared to state-of-the-art offloading-based inference frameworks, HILOS achieves up to 7.86x throughput while reducing energy consumption by up to 85\%. The source code for HILOS is available at https://github.com/hongsunjang/HILOS.
翻译:大型语言模型在生成式推理任务中的计算与内存需求为其实际部署带来了显著挑战。针对离线推理,一种前景广阔的解决方案是基于卸载的批量推理,它通过利用主机内存和存储来扩展GPU的内存层次结构。然而,该方法通常存在显著的I/O开销,这主要源于随批量大小和上下文窗口长度而增大的KV缓存规模。本文提出了HILOS框架,该框架利用近存储处理技术来提升离线推理的吞吐量。HILOS的核心是近存储注意力计算,它将内存密集型的注意力操作卸载到近存储加速器上执行,从而减少了跨系统互连的数据传输。在近存储注意力计算的基础上,HILOS还整合了三项额外优化。首先,协作式X缓存通过卸载后利用可用的主机资源,最大限度地减少了KV缓存的I/O。其次,延迟的KV缓存写回机制隐藏了存储写入延迟并缓解了存储写入放大效应。最后,一种内存高效的注意力加速器在近存储处理设备的资源约束下,为长序列处理维持了高吞吐量。我们在一个配备了16个SmartSSD的真实系统上实现并评估了HILOS。与最先进的基于卸载的推理框架相比,HILOS实现了高达7.86倍的吞吐量提升,同时能耗降低了高达85%。HILOS的源代码发布于 https://github.com/hongsunjang/HILOS。