Efficient inference for on-device Large Language Models (LLMs) remains challenging due to limited hardware resources and the high cost of the prefill stage, which processes the full input context to construct Key-Value (KV) caches. We present SparKV, an adaptive KV loading framework that combines cloud-based KV streaming with on-device computation. SparKV models the cost of individual KV chunks and decides whether each chunk should be streamed or computed locally, while overlapping the two execution paths to reduce latency. To handle fluctuations in wireless connectivity and edge resource availability, SparKV further refines offline-generated schedules at runtime to rebalance communication and computation costs. Experiments across diverse datasets, LLMs, and edge devices show that SparKV reduces Time-to-First-Token by 1.3$x-5.1x with negligible impact on response quality, while lowering per-request energy consumption by 1.5x to 3.3x, demonstrating its robustness and practicality for real-world on-device deployment.
翻译:在资源受限的硬件条件下,设备端大语言模型的推理仍面临严峻挑战,尤其是处理完整输入上下文以构建键值(KV)缓存的预填充阶段存在高额开销。本文提出SparKV——一种自适应KV加载框架,通过融合云端KV流式传输与设备端计算实现高效推理。SparKV对单个KV片段的开销进行建模,自主决策每个片段应流式传输还是本地计算,同时通过两条执行路径的重叠策略降低延迟。为应对无线连接波动与边缘资源可用性变化,SparKV在运行时进一步优化离线生成的调度方案,以重新平衡通信与计算开销。在涵盖多样化数据集、大语言模型与边缘设备的实验中,SparKV将首令牌生成时间降低1.3倍至5.1倍,且对响应质量影响可忽略不计;同时将单次请求能耗降低1.5倍至3.3倍,充分验证了其在真实设备端部署场景中的鲁棒性与实用性。