As software projects rapidly evolve, software artifacts become more complex and defects behind get harder to identify. The emerging Transformer-based approaches, though achieving remarkable performance, struggle with long code sequences due to their self-attention mechanism, which scales quadratically with the sequence length. This paper introduces SparseCoder, an innovative approach incorporating sparse attention and learned token pruning (LTP) method (adapted from natural language processing) to address this limitation. Extensive experiments carried out on a large-scale dataset for vulnerability detection demonstrate the effectiveness and efficiency of SparseCoder, scaling from quadratically to linearly on long code sequence analysis in comparison to CodeBERT and RoBERTa. We further achieve 50% FLOPs reduction with a negligible performance drop of less than 1% comparing to Transformer leveraging sparse attention. Moverover, SparseCoder goes beyond making "black-box" decisions by elucidating the rationale behind those decisions. Code segments that contribute to the final decision can be highlighted with importance scores, offering an interpretable, transparent analysis tool for the software engineering landscape.
翻译:随着软件项目的快速演进,软件制品日趋复杂,其背后的缺陷也更难识别。基于Transformer的新兴方法虽取得显著成效,但其自注意力机制随序列长度呈二次方扩展,难以处理长代码序列。本文提出SparseCoder这一创新方法,通过融合稀疏注意力与源自自然语言处理领域的学习型令牌剪枝(LTP)技术,有效克服了这一局限。在大规模漏洞检测数据集上的广泛实验表明,SparseCoder在长代码序列分析中实现了从二次方复杂度到线性复杂度的突破,其效率与性能均优于CodeBERT和RoBERTa。相较于采用稀疏注意力的Transformer,我们的方法进一步降低了50%的FLOPs,而性能损失不足1%。此外,SparseCoder突破了"黑箱"决策的局限,通过明确阐释决策依据——以重要性分数高亮显示对最终决策有贡献的代码片段——为软件工程领域提供了可解释、透明的分析工具。