Large Language Models (LLMs) are transforming software engineering tasks, including code vulnerability detection-a critical area of software security. However, existing methods often rely on resource-intensive models or graph-based techniques, limiting their accessibility and practicality. This paper introduces K-ASTRO, a lightweight Transformer model that combines semantic embeddings from LLMs with structural features of Abstract Syntax Trees (ASTs) to improve both efficiency and accuracy in code vulnerability detection. Our approach introduces an AST-based augmentation technique inspired by mutation testing, a structure-aware attention mechanism that incorporates augmented AST features, and a joint adaptation pipeline to unify code semantics and syntax. Experimental results on three large-scale datasets, including BigVul, DiverseVul, and PrimeVul-demonstrate state-of-the-art performance while enabling rapid inference on CPUs with minimal training time. By offering a scalable, interpretable, and efficient solution, K-ASTRO bridges the gap between LLM advancements and practical software vulnerability detection, providing open-sourced tools to foster further research.
翻译:大语言模型(LLMs)正在变革包括代码漏洞检测——这一软件安全关键领域——在内的软件工程任务。然而,现有方法通常依赖于资源密集型模型或基于图的技术,限制了其可访问性与实用性。本文提出K-ASTRO,一种轻量级Transformer模型,它结合了来自LLMs的语义嵌入与抽象语法树(ASTs)的结构特征,以提高代码漏洞检测的效率和准确性。我们的方法引入了一种受变异测试启发的基于AST的数据增强技术、一种融合增强AST特征的结构感知注意力机制,以及一个统一代码语义与语法的联合自适应流程。在包括BigVul、DiverseVul和PrimeVul在内的三个大规模数据集上的实验结果表明,该方法实现了最先进的性能,同时能在CPU上实现快速推理且训练时间极短。通过提供一个可扩展、可解释且高效的解决方案,K-ASTRO弥合了LLM进展与实际软件漏洞检测之间的鸿沟,并提供了开源工具以促进进一步研究。