Red-MIRROR: Agentic LLM-based Autonomous Penetration Testing with Reflective Verification and Knowledge-augmented Interaction

Web applications remain the dominant attack surface in cybersecurity, where vulnerabilities such as SQL injection, XSS, and business logic flaws continue to cause significant data breaches. While penetration testing is effective for identifying these weaknesses, traditional manual approaches are time-consuming and heavily dependent on scarce expert knowledge. Recent Large Language Models (LLM)-based multi-agent systems have shown promise in automating penetration testing, yet they still suffer from critical limitations: over-reliance on parametric knowledge, fragmented session memory, and insufficient validation of attack payloads and responses. This paper proposes Red-MIRROR, a novel multi-agent automated penetration testing system that introduces a tightly coupled memory-reflection backbone to explicitly govern inter-agent reasoning. By synthesizing Retrieval-Augmented Generation (RAG) for external knowledge augmentation, a Shared Recurrent Memory Mechanism (SRMM) for persistent state management, and a Dual-Phase Reflection mechanism for adaptive validation, Red-MIRROR provides a robust solution for complex web exploitation. Empirical evaluation on the XBOW benchmark and Vulhub CVEs shows that Red-MIRROR achieves performance comparable to state-of-the-art agents on Vulhub scenarios, while demonstrating a clear advantage on the XBOW benchmark. On the XBOW benchmark, Red-MIRROR attains an overall success rate of 86.0 percent, outperforming PentestAgent (50.0 percent), AutoPT (46.0 percent), and the VulnBot baseline (6.0 percent). Furthermore, the system achieves a 93.99 percent subtask completion rate, indicating strong long-horizon reasoning and payload refinement capability. Finally, we discuss ethical implications and propose safeguards to mitigate misuse risks.

翻译：Web应用程序依然是网络安全领域最主要的攻击面，SQL注入、XSS和业务逻辑缺陷等漏洞持续导致严重的数据泄露事件。尽管渗透测试能有效识别这些弱点，但传统人工方法耗时且高度依赖稀缺的专家知识。近期基于大语言模型的多智能体系统在渗透测试自动化方面展现出潜力，但仍存在关键局限：过度依赖参数化知识、会话记忆碎片化、以及攻击载荷与响应的验证不足。本文提出Red-MIRROR——一种新型多智能体自动化渗透测试系统，通过引入紧耦合的记忆-反射核心框架来显式调控智能体间推理过程。该系统整合检索增强生成实现外部知识增强、共享循环记忆机制实现持久化状态管理、以及双阶段反射机制实现自适应验证，形成面向复杂Web漏洞利用的稳健解决方案。基于XBOW基准测试与Vulhub CVE环境的实证评估表明，Red-MIRROR在Vulhub场景中达到与现有最优智能体相当的性能，同时在XBOW基准测试中展现出显著优势。在XBOW测试中，Red-MIRROR总体成功率达86.0%，优于PentestAgent（50.0%）、AutoPT（46.0%）和VulnBot基线（6.0%）。此外，该系统子任务完成率达到93.99%，体现出强大的长程推理与载荷优化能力。最后，本文讨论伦理影响并提出防范滥用的安全机制。