AI systems sometimes exhibit harmful unintended behaviors post-deployment. This is often despite extensive diagnostics and debugging by developers. Minimizing risks from models is challenging because the attack surface is so large. It is not tractable to exhaustively search for inputs that may cause a model to fail. Red-teaming and adversarial training (AT) are commonly used to make AI systems more robust. However, they have not been sufficient to avoid many real-world failure modes that differ from the ones adversarially trained on. In this work, we utilize latent adversarial training (LAT) to defend against vulnerabilities without generating inputs that elicit them. LAT leverages the compressed, abstract, and structured latent representations of concepts that the network actually uses for prediction. We use LAT to remove trojans and defend against held-out classes of adversarial attacks. We show in image classification, text classification, and text generation tasks that LAT usually improves both robustness and performance on clean data relative to AT. This suggests that LAT can be a promising tool for defending against failure modes that are not explicitly identified by developers.
翻译:人工智能系统在部署后有时会表现出有害的非预期行为。这种情况通常发生在开发者进行了大量诊断和调试之后。由于攻击面过大,最小化模型风险具有挑战性。穷举搜索可能导致模型失效的输入是不可行的。红队测试和对抗训练(AT)常用于提升人工智能系统的鲁棒性。然而,这些方法仍不足以避免许多现实世界中与对抗训练样本不同的失效模式。本研究利用潜在对抗训练(LAT)来防御漏洞,而无需生成引发这些漏洞的输入。LAT利用了网络实际用于预测的压缩、抽象且结构化的概念潜在表征。我们使用LAT来移除特洛伊木马并防御留出的对抗攻击类别。在图像分类、文本分类和文本生成任务中,我们证明相比于AT,LAT通常能同时提升鲁棒性和干净数据上的性能。这表明LAT有望成为防御开发者未明确识别失效模式的工具。