Taint-Based Code Slicing for LLMs-based Malicious NPM Package Detection

Software supply chain attacks on the npm ecosystem have grown increasingly sophisticated, exploiting obfuscation and complex logic to evade detection. Large Language Models (LLMs) offer strong semantic understanding of code but face practical constraints: limited context windows and high inference costs make full-package analysis infeasible, while naive token-based splitting fragments semantic context and degrades accuracy. This paper introduces an LLM-based framework for malicious npm package detection built on code-slicing techniques. We propose an adaptation of taint-based slicing for the npm ecosystem, guided by a curated inventory of JavaScript-specific sensitive APIs, to isolate security-relevant data flows from benign boilerplate. The approach reduces the mean input token count by 99.75% and the median by 93.7% while preserving critical malicious behaviors. Packages relying on dynamic code generation or obfuscation yield empty slices under static analysis and require deobfuscation preprocessing, a limitation we explicitly discuss. The framework is evaluated on a dataset of more than 7000 malicious and benign npm packages using DeepSeek-Coder6.7B. On the 2537 packages amenable to static taint analysis, taint-based slicing achieves 87.04% detection accuracy, outperforming both a naive token-splitting baseline at 75.41% and a CFG-only static slicing approach at 75.65%. These results demonstrate that semantically targeted input representations improve LLM-based detection performance beyond what is achievable through simple input-size reduction, providing an effective and computationally practical defense against evolving open-source supply-chain threats.

翻译：针对npm生态系统的软件供应链攻击日益复杂，攻击者利用代码混淆与复杂逻辑逃避检测。大型语言模型（LLM）虽具备强大的代码语义理解能力，但面临实际约束：有限的上下文窗口与高昂推理成本使全包分析不可行，而基于令牌的朴素分片方法会割裂语义上下文并降低检测精度。本文提出一种基于代码切片技术的LLM恶意npm包检测框架。我们针对npm生态系统改进污点分析切片方法，通过构建JavaScript特有敏感API清单引导分析，从良性样板代码中隔离安全相关数据流。该方法使平均输入令牌数降低99.75%，中位数降低93.7%，同时保留关键恶意行为特征。依赖动态代码生成或混淆的包在静态分析下会产生空切片，需进行反混淆预处理——这一局限性在文中被明确讨论。该框架使用DeepSeek-Coder6.7B在包含7000余个恶意与良性npm包的数据集上评估。在可进行静态污点分析的2537个包中，基于污点分析的切片方法达到87.04%的检测准确率，优于朴素令牌分片基线的75.41%和仅依赖控制流图的静态切片方法的75.65%。结果表明，语义定向的输入表示能提升LLM检测性能，其效果远超单纯缩减输入规模，为应对演化的开源供应链威胁提供了有效且计算可行的防御方案。