The autonomous AI agents using large language models can create undeniable values in all span of the society but they face security threats from adversaries that warrants immediate protective solutions because trust and safety issues arise. Considering the many-shot jailbreaking and deceptive alignment as some of the main advanced attacks, that cannot be mitigated by the static guardrails used during the supervised training, points out a crucial research priority for real world robustness. The combination of static guardrails in dynamic multi-agent system fails to defend against those attacks. We intend to enhance security for LLM-based agents through the development of new evaluation frameworks which identify and counter threats for safe operational deployment. Our work uses three examination methods to detect rogue agents through a Reverse Turing Test and analyze deceptive alignment through multi-agent simulations and develops an anti-jailbreaking system by testing it with GEMINI 1.5 pro and llama-3.3-70B, deepseek r1 models using tool-mediated adversarial scenarios. The detection capabilities are strong such as 94\% accuracy for GEMINI 1.5 pro yet the system suffers persistent vulnerabilities when under long attacks as prompt length increases attack success rates (ASR) and diversity metrics become ineffective in prediction while revealing multiple complex system faults. The findings demonstrate the necessity of adopting flexible security systems based on active monitoring that can be performed by the agents themselves together with adaptable interventions by system admin as the current models can create vulnerabilities that can lead to the unreliable and vulnerable system. So, in our work, we try to address such situations and propose a comprehensive framework to counteract the security issues.
翻译:基于大语言模型的自主智能体虽能在社会各领域创造不可否认的价值,却面临来自对抗方的安全威胁,亟需防护方案以应对由此产生的信任与安全问题。多轮越狱攻击与欺骗性对齐作为当前主要的高级攻击手段,无法通过监督训练阶段采用的静态防护机制缓解,这凸显了现实场景鲁棒性研究的紧迫性。动态多智能体系统中静态防护机制的组合方案仍难以抵御此类攻击。本研究旨在通过开发新型评估框架来增强基于大语言模型的智能体安全性,该框架能够识别并应对安全部署中的威胁。我们采用三种检测方法:通过逆向图灵测试识别恶意智能体,借助多智能体仿真分析欺骗性对齐行为,并构建抗越狱系统——通过工具介导的对抗场景在GEMINI 1.5 pro、llama-3.3-70B及deepseek r1模型上进行测试。系统展现出较强的检测能力(如GEMINI 1.5 pro准确率达94%),但在持续长时攻击下仍存在固有脆弱性:提示长度增加会提升攻击成功率(ASR),多样性指标在预测中失效,同时暴露出多重复杂系统缺陷。研究结果表明,必须建立基于主动监控的弹性安全体系,由智能体自主执行监控并结合系统管理员的动态干预,因为现有模型可能产生导致系统不可靠与脆弱的漏洞。因此,本研究致力于应对此类情境,并提出应对安全问题的综合框架。