Large Language Models (LLMs) have significantly advanced code analysis tasks, yet they struggle to detect malicious behaviors fragmented across files, whose intricate dependencies easily get lost in the vast amount of benign code. We therefore propose a graph-centric attention acquisition pipeline that enhances LLMs' ability to localize malicious behavior. The approach parses a project into a code graph, uses an LLM to encode nodes with semantic and structural signals, and trains a Graph Neural Network (GNN) under sparse supervision. The GNN performs an initial detection, and by interpreting these predictions, identifies key code sections that are most likely to contain malicious behavior. These influential regions are then used to guide the LLM's attention for in-depth analysis. This strategy significantly reduces interference from irrelevant context while maintaining low annotation costs. Extensive experiments show that the method consistently outperforms existing approaches on multiple public and custom datasets, highlighting its potential for practical deployment in software security scenarios.
翻译:大语言模型(LLMs)在代码分析任务上取得了显著进展,但其在检测跨文件分散的恶意行为时仍面临困难,这些行为之间复杂的依赖关系极易淹没在海量良性代码中。为此,我们提出了一种以图为中心的注意力获取流程,以增强LLMs定位恶意行为的能力。该方法将项目解析为代码图,利用LLM对节点进行语义和结构信号编码,并在稀疏监督下训练图神经网络(GNN)。GNN执行初步检测,并通过解释这些预测结果,识别出最可能包含恶意行为的关键代码段。这些关键区域随后被用于引导LLM的注意力进行深入分析。该策略在保持较低标注成本的同时,显著减少了无关上下文的干扰。大量实验表明,该方法在多个公开和自定义数据集上均持续优于现有方法,凸显了其在软件安全场景中实际部署的潜力。