AI systems sometimes exhibit harmful unintended behaviors post-deployment. This is often despite extensive diagnostics and debugging by developers. Minimizing risks from models is challenging because the attack surface is so large. It is not tractable to exhaustively search for inputs that may cause a model to fail. Red-teaming and adversarial training (AT) are commonly used to make AI systems more robust. However, they have not been sufficient to avoid many real-world failure modes that differ from the ones adversarially trained on. In this work, we utilize latent adversarial training (LAT) to defend against vulnerabilities without generating inputs that elicit them. LAT leverages the compressed, abstract, and structured latent representations of concepts that the network actually uses for prediction. We use LAT to remove trojans and defend against held-out classes of adversarial attacks. We show in image classification, text classification, and text generation tasks that LAT usually improves both robustness and performance on clean data relative to AT. This suggests that LAT can be a promising tool for defending against failure modes that are not explicitly identified by developers.
翻译:人工智能系统在部署后有时会表现出有害的意外行为,而这通常发生在开发者进行广泛诊断和调试之后。由于攻击面过大,降低模型风险极具挑战性,穷举搜索可能导致模型失败的输入并不现实。红队测试和对抗训练(AT)是提升AI系统鲁棒性的常用方法,但二者仍不足以避免许多与对抗训练输入不同的现实故障模式。本研究利用潜在对抗训练(LAT)来防御漏洞,而无需生成触发这些漏洞的输入。LAT利用了网络用于预测的压缩、抽象且结构化的概念潜在表征。我们通过LAT移除木马并防御留出的对抗攻击类别。在图像分类、文本分类和文本生成任务中,结果表明相比AT,LAT通常能同时提升鲁棒性和干净数据上的性能。这提示LAT可能成为开发者防御未被明确识别的故障模式的有前景工具。