Modern software package registries like PyPI have become critical infrastructure for software development, but are increasingly exploited by threat actors distributing malicious packages with sophisticated multi-stage attack chains. While Large Language Models (LLMs) offer promising capabilities for automated code analysis, their application to security-critical malware detection faces fundamental challenges, including hallucination and context confusion, which can lead to missed detections or false alarms. We present CHASE (Collaborative Hierarchical Agents for Security Exploration), a high-reliability multi-agent architecture that addresses these limitations through a Plan-and-Execute coordination model, specialized Worker Agents focused on specific analysis aspects, and integration with deterministic security tools for critical operations. Our key insight is that reliability in LLM-based security analysis emerges not from improving individual model capabilities but from architecting systems that compensate for LLM weaknesses while leveraging their semantic understanding strengths. Evaluation on a dataset of 3,000 packages (500 malicious, 2,500 benign) demonstrates that CHASE achieves 98.4% recall with only 0.08% false positive rate, while maintaining a practical median analysis time of 4.5 minutes per package, making it suitable for operational deployment in automated package screening. Furthermore, we conducted a survey with cybersecurity professionals to evaluate the generated analysis reports, identifying their key strengths and areas for improvement. This work provides a blueprint for building reliable AI-powered security tools that can scale with the growing complexity of modern software supply chains. Our project page is available at https://t0d4.github.io/CHASE-AIware25/
翻译:PyPI等现代软件包注册中心已成为软件开发的关键基础设施,但正日益被威胁行为者利用,通过复杂的多阶段攻击链分发恶意软件包。尽管大语言模型为自动化代码分析提供了前景广阔的能力,但其在安全关键型恶意软件检测中的应用面临根本性挑战,包括幻觉和上下文混淆,这可能导致漏检或误报。我们提出CHASE(面向安全探索的协同分层智能体),这是一种高可靠性的多智能体架构,通过"规划-执行"协调模型、专注于特定分析维度的专用工作智能体,以及与确定性安全工具在关键操作环节的集成,有效应对上述局限。我们的核心洞见是:基于大语言模型的安全分析可靠性并非源于提升单个模型能力,而是通过架构设计补偿大语言模型的弱点,同时发挥其语义理解优势。在包含3000个软件包(500个恶意,2500个良性)的数据集上的评估表明,CHASE实现了98.4%的召回率且误报率仅为0.08%,同时保持每个软件包4.5分钟的中位分析时间,适用于自动化软件包筛查的实际部署。此外,我们通过网络安全专家调研评估了生成的分析报告,明确了其核心优势与改进方向。这项工作为构建可靠的人工智能驱动安全工具提供了蓝图,能够适应现代软件供应链日益增长的复杂性。项目页面详见https://t0d4.github.io/CHASE-AIware25/