We present Top-Theta (Top-$θ$) Attention, a training-free method for sparsifying transformer attention during inference. Our key insight is that static, per-head thresholds can be calibrated to retain the desired constant number of significant elements per attention row. This approach enables content-based sparsity without retraining, and it remains robust across data domains. We further introduce compensation techniques to preserve accuracy under aggressive sparsification, establishing attention thresholding as a practical and principled alternative to top-k attention. We provide extensive evaluation on natural language processing tasks, showing that Top-$θ$ achieves 3-10x reduction in V-cache usage and up to 10x fewer attention elements during inference while degrading no more than 1% in accuracy.
翻译:我们提出Top-Theta(Top-$θ$)注意力,一种无需训练的推理阶段稀疏化Transformer注意力的方法。核心见解在于:可校准静态的每头阈值,以在每行注意力中保留所需的恒定数量显著元素。该方法无需重新训练即可实现基于内容的稀疏性,且在不同数据领域保持鲁棒性。我们进一步引入补偿技术以在激进稀疏化下保持精度,将注意力阈值化确立为top-k注意力的一种实用且原则性替代方案。我们在自然语言处理任务上进行广泛评估,表明Top-$θ$在推理过程中可减少3-10倍的V缓存使用量以及高达10倍的注意力元素数量,同时准确率下降不超过1%。