In this paper, we introduce a new task for code completion that focuses on handling long code input and propose a sparse Transformer model, called LongCoder, to address this task. LongCoder employs a sliding window mechanism for self-attention and introduces two types of globally accessible tokens - bridge tokens and memory tokens - to improve performance and efficiency. Bridge tokens are inserted throughout the input sequence to aggregate local information and facilitate global interaction, while memory tokens are included to highlight important statements that may be invoked later and need to be memorized, such as package imports and definitions of classes, functions, or structures. We conduct experiments on a newly constructed dataset that contains longer code context and the publicly available CodeXGLUE benchmark. Experimental results demonstrate that LongCoder achieves superior performance on code completion tasks compared to previous models while maintaining comparable efficiency in terms of computational resources during inference. All the codes and data are available at https://github.com/microsoft/CodeBERT.
翻译:本文提出了一种专注于处理长代码输入的新代码补全任务,并设计了一种名为LongCoder的稀疏Transformer模型来应对该任务。LongCoder采用滑动窗口自注意力机制,并引入两种全局可访问的标记——桥接标记和记忆标记——以提升性能与效率。其中,桥接标记被插入输入序列中,用于聚合局部信息并促进全局交互;记忆标记则用于突出需要被后续调用和记忆的重要语句(如包引入、类/函数/结构的定义)。我们在新构建的包含更长代码上下文的数据集以及公开的CodeXGLUE基准上进行了实验。结果表明,LongCoder在代码补全任务上相比之前模型展现出卓越性能,同时在推理阶段保持了相近的计算资源效率。所有代码与数据均可从 https://github.com/microsoft/CodeBERT 获取。