Sparse attention as a efficient method can significantly decrease the computation cost, but current sparse attention tend to rely on window self attention which block the global information flow. For this problem, we present Shifted Cross Chunk Attention (SCCA), using different KV shifting strategy to extend respective field in each attention layer. Except, we combine Dilated Attention(DA) and Dilated Neighborhood Attention(DNA) to present Shifted Dilated Attention(SDA). Both SCCA and SDA can accumulate attention results in multi head attention to obtain approximate respective field in full attention. In this paper, we conduct language modeling experiments using different pattern of SCCA and combination of SCCA and SDA. The proposed shifted cross chunk attention (SCCA) can effectively extend large language models (LLMs) to longer context combined with Positional interpolation(PI) and LoRA than current sparse attention. Notably, SCCA adopts LLaMA2 7B from 4k context to 8k in single V100. This attention pattern can provide a Plug-and-play fine-tuning method to extend model context while retaining their original architectures, and is compatible with most existing techniques.
翻译:稀疏注意力作为一种高效方法可显著降低计算成本,但现有稀疏注意力多依赖窗口自注意力机制,阻碍了全局信息流动。针对该问题,我们提出移位交叉分块注意力(SCCA),通过在不同注意力层采用不同的键值移位策略来扩展感受野。此外,我们结合膨胀注意力(DA)与膨胀邻域注意力(DNA)提出移位膨胀注意力(SDA)。SCCA与SDA均可在多头注意力中累积注意力结果,获得接近全注意力的近似感受野。本文采用不同模式的SCCA以及SCCA与SDA的组合进行语言建模实验。所提出的移位交叉分块注意力(SCCA)结合位置插值(PI)与LoRA方法,比现有稀疏注意力能更有效地将大语言模型(LLMs)扩展至更长上下文。值得注意的是,SCCA可在单块V100显卡上将LLaMA2 7B模型的上下文窗口从4k扩展至8k。该注意力模式提供了一种即插即用的微调方法,可在保留原始架构的同时扩展模型上下文,并与现有大多数技术兼容。