Handling long-context sequences efficiently remains a significant challenge in large language models (LLMs). Existing methods for token selection in sequence extrapolation either employ a permanent eviction strategy or select tokens by chunk, which may lead to the loss of critical information. We propose Efficient Selective Attention (ESA), a novel approach that extends context length by efficiently selecting the most critical tokens at the token level to compute attention. ESA reduces the computational complexity of token selection by compressing query and key vectors into lower-dimensional representations. We evaluate ESA on long sequence benchmarks with maximum lengths up to 256k using open-source LLMs with context lengths of 8k and 32k. ESA outperforms other selective attention methods, especially in tasks requiring the retrieval of multiple pieces of information, achieving comparable performance to full-attention extrapolation methods across various tasks, with superior results in certain tasks.
翻译:高效处理长上下文序列仍然是大型语言模型(LLMs)面临的一项重大挑战。现有序列外推中的令牌选择方法要么采用永久淘汰策略,要么按块选择令牌,这可能导致关键信息的丢失。我们提出了一种新颖的方法——高效选择性注意力(ESA),该方法通过在令牌级别高效选择最关键的令牌来计算注意力,从而扩展上下文长度。ESA通过将查询和键向量压缩为低维表示,降低了令牌选择的计算复杂度。我们在最大长度达256k的长序列基准测试中,使用上下文长度为8k和32k的开源LLMs对ESA进行了评估。ESA优于其他选择性注意力方法,尤其是在需要检索多条信息的任务中,在各种任务上实现了与全注意力外推方法相当的性能,并在某些任务中取得了更优的结果。