On-device running Large Language Models (LLMs) is nowadays a critical enabler towards preserving user privacy. We observe that the attention operator falls back from the special-purpose NPU to the general-purpose CPU/GPU because of quantization sensitivity in state-of-the-art frameworks. This fallback results in a degraded user experience and increased complexity in system scheduling. To this end, this paper presents shadowAttn, a system-algorithm codesigned sparse attention module with minimal reliance on CPU/GPU by only sparsely calculating the attention on a tiny portion of tokens. The key idea is to hide the overhead of estimating the important tokens with a NPU-based pilot compute. Further, shadowAttn proposes insightful techniques such as NPU compute graph bucketing, head-wise NPU-CPU/GPU pipeline and per-head fine-grained sparsity ratio to achieve high accuracy and efficiency. shadowAttn delivers the best performance with highly limited CPU/GPU resource; it requires much less CPU/GPU resource to deliver on-par performance of SoTA frameworks.
翻译:中文摘要:当前,在移动设备上运行大语言模型是实现用户隐私保护的关键技术。我们观察到,在现有最先进框架中,由于注意力算子对量化敏感,其计算会从专用NPU回退至通用CPU/GPU。这种回退机制导致用户体验下降并增加系统调度复杂度。为此,本文提出shadowAttn——一种系统与算法协同设计的稀疏注意力模块,通过仅对少量token进行稀疏注意力计算,显著降低对CPU/GPU的依赖。其核心思想是利用NPU驱动的预计算来隐藏重要token评估的开销。进一步,shadowAttn创新性提出NPU计算图分桶、逐头NPU-CPU/GPU流水线及逐头细粒度稀疏率等技术,以实现高精度与高效率。在CPU/GPU资源极度受限场景下,shadowAttn表现最优;在同等性能下,其所需CPU/GPU资源远少于现有最先进框架。