Attention patterns play a crucial role in both training and inference of large language models (LLMs). Prior works have identified individual patterns such as retrieval heads, sink heads, and diagonal traces, yet these observations remain fragmented and lack a unifying explanation. To bridge this gap, we introduce \textbf{Temporal Attention Pattern Predictability Analysis (TAPPA), a unifying framework that explains diverse attention patterns by analyzing their underlying mathematical formulations} from a temporally continuous perspective. TAPPA both deepens the understanding of attention behavior and guides inference acceleration approaches. Specifically, TAPPA characterizes attention patterns as predictable patterns with clear regularities and unpredictable patterns that appear effectively random. Our analysis further reveals that this distinction can be explained by the degree of query self-similarity along the temporal dimension. Focusing on the predictable patterns, we further provide a detailed mathematical analysis of three representative cases through the joint effect of queries, keys, and Rotary Positional Embeddings (RoPE). We validate TAPPA by applying its insights to KV cache compression and LLM pruning tasks. Across these tasks, a simple metric motivated by TAPPA consistently improves performance over baseline methods. The code is available at https://github.com/MIRALab-USTC/LLM-TAPPA.
翻译:注意力模式在大型语言模型(LLM)的训练与推理中均起着至关重要的作用。先前的研究已识别出诸如检索头、汇聚头和对角线轨迹等独立模式,但这些观察结果仍呈碎片化状态,缺乏统一的解释。为弥合这一差距,我们提出了**时序注意力模式可预测性分析(TAPPA)**,这是一个通过从时序连续视角分析其底层数学公式来解释多样化注意力模式的统一框架。TAPPA不仅深化了对注意力行为的理解,也指导了推理加速方法。具体而言,TAPPA将注意力模式刻画为具有清晰规律性的可预测模式与表现为有效随机的不可预测模式。我们的分析进一步揭示,这种区分可以通过查询在时序维度上的自相似度来解释。聚焦于可预测模式,我们进一步通过查询、键与旋转位置编码(RoPE)的联合效应,对三个代表性案例进行了详细的数学分析。我们通过将TAPPA的洞见应用于KV缓存压缩和LLM剪枝任务来验证其有效性。在这些任务中,一个受TAPPA启发的简单指标持续超越了基线方法的性能。代码可在 https://github.com/MIRALab-USTC/LLM-TAPPA 获取。