CHASE：基于大语言模型智能体剖析恶意PyPI软件包 (CHASE: LLM Agents for Dissecting Malicious PyPI Packages)

Modern software package registries like PyPI have become critical infrastructure for software development, but are increasingly exploited by threat actors distributing malicious packages with sophisticated multi-stage attack chains. While Large Language Models (LLMs) offer promising capabilities for automated code analysis, their application to security-critical malware detection faces fundamental challenges, including hallucination and context confusion, which can lead to missed detections or false alarms. We present CHASE (Collaborative Hierarchical Agents for Security Exploration), a high-reliability multi-agent architecture that addresses these limitations through a Plan-and-Execute coordination model, specialized Worker Agents focused on specific analysis aspects, and integration with deterministic security tools for critical operations. Our key insight is that reliability in LLM-based security analysis emerges not from improving individual model capabilities but from architecting systems that compensate for LLM weaknesses while leveraging their semantic understanding strengths. Evaluation on a dataset of 3,000 packages (500 malicious, 2,500 benign) demonstrates that CHASE achieves 98.4% recall with only 0.08% false positive rate, while maintaining a practical median analysis time of 4.5 minutes per package, making it suitable for operational deployment in automated package screening. Furthermore, we conducted a survey with cybersecurity professionals to evaluate the generated analysis reports, identifying their key strengths and areas for improvement. This work provides a blueprint for building reliable AI-powered security tools that can scale with the growing complexity of modern software supply chains. Our project page is available at https://t0d4.github.io/CHASE-AIware25/

翻译：PyPI等现代软件包注册中心已成为软件开发的关键基础设施，但正日益被威胁行为者利用，通过复杂的多阶段攻击链分发恶意软件包。尽管大语言模型为自动化代码分析提供了前景广阔的能力，但其在安全关键型恶意软件检测中的应用面临根本性挑战，包括幻觉和上下文混淆，这可能导致漏检或误报。我们提出CHASE（面向安全探索的协同分层智能体），这是一种高可靠性的多智能体架构，通过"规划-执行"协调模型、专注于特定分析维度的专用工作智能体，以及与确定性安全工具在关键操作环节的集成，有效应对上述局限。我们的核心洞见是：基于大语言模型的安全分析可靠性并非源于提升单个模型能力，而是通过架构设计补偿大语言模型的弱点，同时发挥其语义理解优势。在包含3000个软件包（500个恶意，2500个良性）的数据集上的评估表明，CHASE实现了98.4%的召回率且误报率仅为0.08%，同时保持每个软件包4.5分钟的中位分析时间，适用于自动化软件包筛查的实际部署。此外，我们通过网络安全专家调研评估了生成的分析报告，明确了其核心优势与改进方向。这项工作为构建可靠的人工智能驱动安全工具提供了蓝图，能够适应现代软件供应链日益增长的复杂性。项目页面详见https://t0d4.github.io/CHASE-AIware25/