The Intelligence Advanced Research Projects Activity (IARPA) launched the TrojAI program to confront an emerging vulnerability in modern artificial intelligence: the threat of AI Trojans. These AI trojans are malicious, hidden backdoors intentionally embedded within an AI model that can cause a system to fail in unexpected ways, or allow a malicious actor to hijack the AI model at will. This multi-year initiative helped to map out the complex nature of the threat, pioneered foundational detection methods, and identified unsolved challenges that require ongoing attention by the burgeoning AI security field. This report synthesizes the program's key findings, including methodologies for detection through weight analysis and trigger inversion, as well as approaches for mitigating Trojan risks in deployed models. Comprehensive test and evaluation results highlight detector performance, sensitivity, and the prevalence of "natural" Trojans. The report concludes with lessons learned and recommendations for advancing AI security research.
翻译:美国情报高级研究计划局(IARPA)启动了TrojAI计划,以应对现代人工智能中一个新兴的脆弱性:AI木马的威胁。这些AI木马是恶意、隐蔽的后门,被有意嵌入AI模型中,可导致系统以意外方式失效,或允许恶意行为者随意劫持AI模型。这项为期多年的计划帮助厘清了该威胁的复杂本质,开创了基础性的检测方法,并指出了蓬勃发展的AI安全领域仍需持续关注的未解难题。本报告综合了该计划的关键发现,包括通过权重分析和触发器逆向进行检测的方法论,以及缓解已部署模型中木马风险的对策。全面的测试与评估结果凸显了检测器的性能、灵敏度以及“自然”木马的普遍性。报告最后总结了经验教训,并为推进AI安全研究提出了建议。