Malicious behavior is often hidden in small, easily overlooked code fragments, especially within large and complex codebases. The cross-file dependencies of these fragments make it difficult for even powerful large language models (LLMs) to detect them reliably. We propose a graph-centric attention acquisition pipeline that enhances LLMs' ability to localize malicious behavior. The approach parses a project into a code graph, uses an LLM to encode nodes with semantic and structural signals, and trains a Graph Neural Network (GNN) under sparse supervision. The GNN performs an initial detection, and through backtracking of its predictions, identifies key code sections that are most likely to contain malicious behavior. These influential regions are then used to guide the LLM's attention for in-depth analysis. This strategy significantly reduces interference from irrelevant context while maintaining low annotation costs. Extensive experiments show that the method consistently outperforms existing methods on multiple public and self-built datasets, highlighting its potential for practical deployment in software security scenarios.
翻译:恶意行为通常隐藏于微小且易被忽视的代码片段中,尤其是在庞大而复杂的代码库内。这些片段间的跨文件依赖关系使得即便是功能强大的大型语言模型(LLMs)也难以可靠地检测它们。我们提出了一种以图为中心的注意力获取流程,以增强LLMs定位恶意行为的能力。该方法将项目解析为代码图,利用LLM对节点进行语义和结构信号编码,并在稀疏监督下训练图神经网络(GNN)。GNN执行初步检测,并通过对其预测结果进行回溯,识别出最可能包含恶意行为的关键代码段。这些关键区域随后被用于引导LLM的注意力以进行深入分析。该策略在保持较低标注成本的同时,显著减少了无关上下文的干扰。大量实验表明,该方法在多个公开及自建数据集上持续优于现有方法,凸显了其在软件安全场景中实际部署的潜力。