Guardians of the Agentic System: Preventing Many Shots Jailbreak with Agentic System

The autonomous AI agents using large language models can create undeniable values in all span of the society but they face security threats from adversaries that warrants immediate protective solutions because trust and safety issues arise. Considering the many-shot jailbreaking and deceptive alignment as some of the main advanced attacks, that cannot be mitigated by the static guardrails used during the supervised training, points out a crucial research priority for real world robustness. The combination of static guardrails in dynamic multi-agent system fails to defend against those attacks. We intend to enhance security for LLM-based agents through the development of new evaluation frameworks which identify and counter threats for safe operational deployment. Our work uses three examination methods to detect rogue agents through a Reverse Turing Test and analyze deceptive alignment through multi-agent simulations and develops an anti-jailbreaking system by testing it with GEMINI 1.5 pro and llama-3.3-70B, deepseek r1 models using tool-mediated adversarial scenarios. The detection capabilities are strong such as 94\% accuracy for GEMINI 1.5 pro yet the system suffers persistent vulnerabilities when under long attacks as prompt length increases attack success rates (ASR) and diversity metrics become ineffective in prediction while revealing multiple complex system faults. The findings demonstrate the necessity of adopting flexible security systems based on active monitoring that can be performed by the agents themselves together with adaptable interventions by system admin as the current models can create vulnerabilities that can lead to the unreliable and vulnerable system. So, in our work, we try to address such situations and propose a comprehensive framework to counteract the security issues.

翻译：基于大语言模型的自主智能体虽能在社会各领域创造不可否认的价值，却面临来自对抗者的安全威胁，亟需即时防护方案，因为信任与安全问题已然显现。将多次尝试越狱攻击与欺骗性对齐视为当前主要的高级攻击手段——这些攻击无法通过监督训练阶段使用的静态防护栏缓解——指明了现实世界鲁棒性研究的核心优先方向。动态多智能体系统中静态防护栏的组合防御对此类攻击依然失效。我们旨在通过开发新型评估框架来增强基于大语言模型的智能体安全性，该框架能够识别并应对威胁，保障安全运营部署。本研究采用三种检测方法：通过逆向图灵测试识别恶意智能体，借助多智能体仿真分析欺骗性对齐行为，并构建抗越狱系统——通过工具介导的对抗场景在GEMINI 1.5 pro、llama-3.3-70B及deepseek r1模型上进行测试。系统检测能力显著（如GEMINI 1.5 pro准确率达94%），但在持续长时攻击下仍存在固有漏洞：提示长度增加会提升攻击成功率（ASR），多样性指标在预测中失效，同时暴露出多重复杂系统缺陷。研究结果表明，必须采用基于主动监控的弹性安全系统——这种监控可由智能体自主执行，辅以系统管理员的适应性干预——因为现有模型可能产生导致系统不可靠与脆弱的漏洞。因此，本研究致力于应对此类情境，并提出综合框架以应对相关安全问题。