Large Language Models (LLMs) have significantly advanced code analysis tasks, yet they struggle to detect malicious behaviors fragmented across files, whose intricate dependencies easily get lost in the vast amount of benign code. We therefore propose a graph-centric attention acquisition pipeline that enhances LLMs' ability to localize malicious behavior. The approach parses a project into a code graph, uses an LLM to encode nodes with semantic and structural signals, and trains a Graph Neural Network (GNN) under sparse supervision. The GNN performs an initial detection, and by interpreting these predictions, identifies key code sections that are most likely to contain malicious behavior. These influential regions are then used to guide the LLM's attention for in-depth analysis. This strategy significantly reduces interference from irrelevant context while maintaining low annotation costs. Extensive experiments show that the method consistently outperforms existing approaches on multiple public and custom datasets, highlighting its potential for practical deployment in software security scenarios.
翻译:大语言模型(LLM)在代码分析任务中取得了显著进展,但在检测跨文件分散的恶意行为时仍面临挑战——这些行为间复杂的依赖关系极易在海量良性代码中丢失。为此,我们提出一种以图为中心的注意力获取流水线,以增强大语言模型定位恶意行为的能力。该方法将项目解析为代码图,利用大语言模型对节点进行语义与结构信号的编码,并在稀疏监督下训练图神经网络(GNN)。该GNN执行初步检测,并通过解释这些预测结果识别最可能包含恶意行为的关键代码段。这些关键区域随后被用于引导大语言模型的注意力以进行深度分析。该策略在显著降低无关上下文干扰的同时,保持了较低的标注成本。大量实验表明,该方法在多个公开及自定义数据集上持续优于现有方法,凸显了其在软件安全场景中实际部署的潜力。