To achieve reliable, robust, and safe AI systems, it is vital to implement fallback strategies when AI predictions cannot be trusted. Certifiers for neural networks are a reliable way to check the robustness of these predictions. They guarantee for some predictions that a certain class of manipulations or attacks could not have changed the outcome. For the remaining predictions without guarantees, the method abstains from making a prediction, and a fallback strategy needs to be invoked, which typically incurs additional costs, can require a human operator, or even fail to provide any prediction. While this is a key concept towards safe and secure AI, we show for the first time that this approach comes with its own security risks, as such fallback strategies can be deliberately triggered by an adversary. In addition to naturally occurring abstains for some inputs and perturbations, the adversary can use training-time attacks to deliberately trigger the fallback with high probability. This transfers the main system load onto the fallback, reducing the overall system's integrity and/or availability. We design two novel availability attacks, which show the practical relevance of these threats. For example, adding 1% poisoned data during training is sufficient to trigger the fallback and hence make the model unavailable for up to 100% of all inputs by inserting the trigger. Our extensive experiments across multiple datasets, model architectures, and certifiers demonstrate the broad applicability of these attacks. An initial investigation into potential defenses shows that current approaches are insufficient to mitigate the issue, highlighting the need for new, specific solutions.
翻译:为实现可靠、鲁棒且安全的人工智能系统,当AI预测不可信时,实施备用策略至关重要。神经网络证书是检验预测鲁棒性的可靠方法。它们对某些预测保证:特定类别的操纵或攻击无法改变结果。对于无保证的剩余预测,该方法将避免做出预测,需调用备用策略——这通常会产生额外成本、需要人工操作,甚至无法提供任何预测。尽管这一概念对构建安全可靠的AI至关重要,但我们首次证明该方法本身存在安全风险,因为攻击者可故意触发此类备用策略。除了某些输入和扰动自然导致的弃权外,攻击者还能利用训练时攻击高概率故意触发备用机制。这将主系统负载转移至备用系统,降低整体系统的完整性和/或可用性。我们设计了两种新型可用性攻击,证明了这些威胁的实际相关性。例如,在训练过程中仅添加1%的中毒数据即可触发备用机制,从而通过插入触发器使模型对多达100%的输入不可用。我们在多个数据集、模型架构和证书系统上进行的广泛实验表明,这些攻击具有广泛适用性。对潜在防御措施的初步研究表明,当前方法不足以缓解该问题,亟需开发新的专门解决方案。