Linear Attention Large Language Models (LLMs) offer a compelling recurrent formulation that compresses context into a fixed-size state matrix, enabling constant-time inference. However, the internal dynamics of this compressed state remain largely opaque. In this work, we present a comprehensive study on the runtime state dynamics of state-of-the-art Linear Attention models. We uncover a fundamental phenomenon termed State Rank Stratification, characterized by a distinct spectral bifurcation among linear attention heads: while one group maintains an effective rank oscillating near zero, the other exhibits rapid growth that converges to an upper bound. Extensive experiments across diverse inference contexts reveal that these dynamics remain strikingly consistent, indicating that the identity of a head,whether low-rank or high-rank,is an intrinsic structural property acquired during pre-training, rather than a transient state dependent on the input data. Furthermore, our diagnostic probes reveal a surprising functional divergence: low-rank heads are indispensable for model reasoning, whereas high-rank heads exhibit significant redundancy. Leveraging this insight, we propose Joint Rank-Norm Pruning, a zero-shot strategy that achieves a 38.9\% reduction in KV-cache overhead while largely maintaining model accuracy.
翻译:线性注意力大语言模型提供了一种引人注目的循环形式,它将上下文压缩到一个固定大小的状态矩阵中,从而实现了恒定时间推理。然而,这种压缩状态的内部动力学在很大程度上仍然是不透明的。在这项工作中,我们对最先进的线性注意力模型的运行时状态动力学进行了全面的研究。我们发现了一个称为"状态秩分层"的基本现象,其特征是线性注意力头之间出现明显的谱分叉:一组注意力头保持其有效秩在零附近振荡,而另一组则表现出快速增长并收敛到一个上界。在不同推理上下文下的大量实验表明,这些动力学特性惊人地一致,这表明一个注意力头是低秩还是高秩,是在预训练期间获得的内在结构属性,而不是依赖于输入数据的瞬态。此外,我们的诊断探针揭示了一个令人惊讶的功能分化:低秩头对于模型推理是不可或缺的,而高秩头则表现出显著的冗余性。利用这一见解,我们提出了联合秩-范数剪枝,这是一种零样本策略,在基本保持模型精度的同时,实现了KV缓存开销减少38.9%。