Deploying Large Language Models (LLMs) on memory-constrained AI Personal Computers (AIPCs) enables low-latency, privacy-preserving inference, but long-context generation is fundamentally bottlenecked by the linearly growing Key-Value (KV) cache. While dynamic KV eviction mitigates this memory wall, existing offloading strategies either trigger crippling PCIe I/O bottlenecks on standard SSDs or suffer from FPGA resource exhaustion by forcing compute-intensive exact attention on a single, weak Computational Storage Drive (CSD). In this paper, we propose HillInfer, a CSD-assisted KV eviction framework that introduces a paradigm shift: offloading strictly lightweight token importance evaluation to a single CSD (e.g., SmartSSD) on AIPCs. To fully capitalize on this lightweight offloading strategy, HillInfer orchestrates a Hierarchical KV Cache Manager (HKM) that leverages temporal locality and dynamic token hit rates to physically partition cache pools, thereby eliminating cross-device I/O thrashing. Additionally, we design an Adaptive Prefetch-based Pipeline (APP) that adaptively balances the evaluation workload between the host CPU and the SmartSSD, effectively masking the heterogeneous straggler effect. Finally, we introduce a CSD-based Evaluation Configuration (CEC) to enable resource-efficient near-data processing on the FPGA. Extensive experiments on a commodity AIPC demonstrate that HillInfer achieves up to an 8.56$\times$ speedup over state-of-the-art baselines, delivering low-latency, I/O-efficient long-context inference without sacrificing model accuracy.
翻译:在内存受限的AI个人电脑(AIPC)上部署大语言模型(LLM)能够实现低延迟、保护隐私的推理,但长上下文生成从根本上受限于线性增长的键值(KV)缓存。尽管动态KV驱逐技术可缓解这一内存瓶颈,但现有卸载策略要么在标准SSD上引发严重的PCIe I/O瓶颈,要么因将计算密集的精确注意力强加于单个弱计算存储驱动器(CSD)而导致FPGA资源耗尽。本文提出HillInfer——一种CSD辅助的KV驱逐框架,它引入范式转变:将严格轻量级的令牌重要性评估卸载至AIPC上的单个CSD(如SmartSSD)。为充分利用这一轻量级卸载策略,HillInfer协调分层KV缓存管理器(HKM),利用时间局部性和动态令牌命中率对缓存池进行物理划分,从而消除跨设备I/O颠簸。此外,我们设计基于自适应预取的流水线(APP),在主机CPU与SmartSSD之间自适应平衡评估负载,有效掩盖异构延迟效应。最后,我们提出基于CSD的评估配置(CEC),在FPGA上实现资源高效的近数据处理。在商用AIPC上的大量实验表明,HillInfer相较于最先进基线方法实现高达8.56倍的加速,在不牺牲模型准确率的前提下提供低延迟、I/O高效的长上下文推理。