The prefill stage in long-context LLM inference remains a computational bottleneck. Recent token-ranking heuristics accelerate inference by selectively processing a subset of semantically relevant tokens. However, existing methods suffer from unstable token importance estimation, often varying between layers. Evaluating token-ranking quality independently from heuristic-specific architectures is challenging. To address this, we introduce an Answer-Informed Oracle, which defines ground-truth token importance by measuring attention from generated answers back to the prompt. This oracle reveals that existing heuristics exhibit high variance across layers: rankings can degrade sharply at specific layers, a failure mode invisible to end-to-end benchmarks. The diagnosis suggests a simple fix: aggregate scores across layers rather than relying on any single one. We implement this as Cross-Layer Attention Aggregation (CLAA), which closes the gap to the oracle upper bound and reduces Time-to-First-Token (TTFT) by up to 39\% compared to the Full KV Cache baseline.
翻译:长上下文大语言模型推理中的预填充阶段仍是计算瓶颈。近期的令牌排序启发式方法通过选择性处理语义相关令牌的子集来加速推理。然而,现有方法存在令牌重要性估计不稳定的问题,其估计值常在层间波动。脱离特定启发式架构独立评估令牌排序质量具有挑战性。为此,我们引入了答案感知预言机,它通过测量从生成答案到提示的注意力来定义真实令牌重要性。该预言机揭示出现有启发式方法在层间表现出高方差:排序可能在特定层急剧退化,这种失效模式在端到端基准测试中不可见。此诊断表明一个简单的修正方案:跨层聚合分数而非依赖任何单层。我们将其实现为跨层注意力聚合(CLAA),该方法缩小了与预言机上界的差距,并与完整KV缓存基线相比,将首令牌生成时间(TTFT)降低了最高达39%。