Hybrid architectures combining full attention (FA) and sliding-window attention (SWA) are a promising paradigm for efficient LLM inference. However, existing methods typically rely on hand-crafted rules or simple post-hoc heuristics for FA/SWA allocation and offer limited analysis of the attention behaviors underlying these designs. We propose Controllable Sparsity in Hybrid Attention (ConSA), a framework that learns optimal FA/SWA assignment under a user-specified sparsity target. ConSA employs L0 regularization to learn binary masks selecting between FA and SWA for each attention unit, while an augmented Lagrangian constraint enforces the target sparsity at either layer or KV-head granularity. We evaluate ConSA on two LLMs at the 0.6B and 1.7B scales. Learned allocations consistently outperform rule-based baselines, with KV-head-wise allocation yielding clear gains over layer-wise allocation. The learned patterns place SWA in the bottom layers and concentrate FA into contiguous middle-layer blocks, diverging from evenly interleaved patterns in rule-based methods. This structure persists across model scales, sparsity levels, and allocation granularities, revealing a fine-grained spectrum of intrinsic attention behaviors that underlies the learned allocation.
翻译:结合全注意力和滑动窗口注意力的混合架构是高效大语言模型推理的一种有前途范式。然而,现有方法通常依赖人工设计的规则或简单的后验启发式方法进行全注意力/滑动窗口注意力的分配,并且对这些设计背后的注意力行为分析有限。我们提出可控稀疏混合注意力(ConSA),这是一个在用户指定的稀疏度目标下学习最优全注意力/滑动窗口注意力分配的框架。ConSA利用L0正则化学习为每个注意力单元选择全注意力或滑动窗口注意力的二元掩码,同时通过增广拉格朗日约束在层或KV头粒度上强制执行目标稀疏度。我们在0.6B和1.7B参数规模的两个大语言模型上评估了ConSA。学习得到的分配一致优于基于规则的基线,其中基于KV头的分配相比逐层分配展现出明显优势。学习到的模式将滑动窗口注意力置于底层,并将全注意力集中到连续的中间层块中,这与基于规则方法中均匀交错模式形成对比。该结构在模型规模、稀疏度水平和分配粒度上保持稳定,揭示了作为学习分配基础的细粒度内在注意力行为谱系。