Efficient inference for on-device Large Language Models (LLMs) remains challenging due to limited hardware resources and the high cost of the prefill stage, which processes the full input context to construct Key-Value (KV) caches. We present SparKV, an adaptive KV loading framework that combines cloud-based KV streaming with on-device computation. SparKV models the cost of individual KV chunks and decides whether each chunk should be streamed or computed locally, while overlapping the two execution paths to reduce latency. To handle fluctuations in wireless connectivity and edge resource availability, SparKV further refines offline-generated schedules at runtime to rebalance communication and computation costs. Experiments across diverse datasets, LLMs, and edge devices show that SparKV reduces Time-to-First-Token by 1.3$x-5.1x with negligible impact on response quality, while lowering per-request energy consumption by 1.5x to 3.3x, demonstrating its robustness and practicality for real-world on-device deployment.
翻译:摘要:由于硬件资源有限,且预填充阶段(需处理完整输入上下文以构建键值缓存)成本高昂,设备端大语言模型的高效推理仍面临挑战。本文提出SparKV——一种自适应KV加载框架,融合了云端KV流式传输与设备端计算。SparKV对单个KV块的成本建模,并决定每个块应流式传输还是本地计算,同时重叠两条执行路径以减少延迟。为应对无线连接波动与边缘资源可用性变化,SparKV在运行时进一步优化离线生成的调度方案,以重新平衡通信与计算成本。在多种数据集、大语言模型及边缘设备上的实验表明:SparKV将首令牌生成时间降低1.3倍至5.1倍,且对响应质量影响可忽略不计;同时将单次请求能耗降低1.5倍至3.3倍,验证了其在真实设备端部署场景下的鲁棒性与实用性。