Deploying Large Language Models (LLMs) on edge devices such as PCs enables low-latency inference with strong privacy guarantees, but long-context inference is fundamentally constrained by limited memory and compute resources. Beyond model parameters, the KV cache becomes the dominant bottleneck due to its linear growth with context length. Although prior work exploits contextual sparsity to evict unimportant KV data, these approaches are largely designed for memory-rich platforms and incur prohibitive data transfer overhead when applied to resource-constrained edge devices with external storage. In this paper, we propose HillInfer, an importance-aware long-context LLM inference framework on the edge that leverages SmartSSD-assisted hierarchical KV cache management. HillInfer jointly manages KV cache pools across the CPU and SmartSSD, and performs in-storage importance evaluation to reduce unnecessary data movement. Furthermore, we design an adaptive, prefetch-based pipeline that overlaps computation and KV data transfer across GPU, CPU, and SmartSSD, minimizing end-to-end inference latency without sacrificing accuracy. We implement HillInfer on a PC with a commodity GPU, and experiments across multiple models and benchmarks demonstrate up to 8.56 $\times$ speedup over baselines while preserving model accuracy.
翻译:在个人计算机等边缘设备上部署大语言模型能够实现低延迟推理并提供强大的隐私保障,但长上下文推理从根本上受限于有限的内存和计算资源。除模型参数外,KV缓存因其随上下文长度线性增长的特性成为主要瓶颈。尽管已有研究利用上下文稀疏性淘汰不重要的KV数据,但这些方法主要面向内存充裕的平台,当应用于配备外部存储的资源受限边缘设备时,会产生极高的数据传输开销。本文提出HillInfer——一种基于SmartSSD辅助分层KV缓存管理、具备重要性感知能力的边缘设备长上下文大语言模型推理框架。HillInfer联合管理CPU与SmartSSD间的KV缓存池,并通过存储内重要性评估减少不必要的数据迁移。此外,我们设计了一种基于自适应预取的流水线机制,在GPU、CPU与SmartSSD之间实现计算与KV数据传输的重叠,从而在不牺牲精度的前提下最小化端到端推理延迟。我们在配备商用GPU的个人计算机上实现了HillInfer,跨多模型与基准测试的实验表明,该系统在保持模型精度的同时,较基线方法最高可实现8.56倍的加速比。