In this work, we highlight and perform a comprehensive study on calibration attacks, a form of adversarial attacks that aim to trap victim models to be heavily miscalibrated without altering their predicted labels, hence endangering the trustworthiness of the models and follow-up decision making based on their confidence. We propose four typical forms of calibration attacks: underconfidence, overconfidence, maximum miscalibration, and random confidence attacks, conducted in both black-box and white-box setups. We demonstrate that the attacks are highly effective on both convolutional and attention-based models: with a small number of queries, they seriously skew confidence without changing the predictive performance. Given the potential danger, we further investigate the effectiveness of a wide range of adversarial defence and recalibration methods, including our proposed defences specifically designed for calibration attacks to mitigate the harm. From the ECE and KS scores, we observe that there are still significant limitations in handling calibration attacks. To the best of our knowledge, this is the first dedicated study that provides a comprehensive investigation on calibration-focused attacks. We hope this study helps attract more attention to these types of attacks and hence hamper their potential serious damages. To this end, this work also provides detailed analyses to understand the characteristics of the attacks. Our code is available at https://github.com/PhenetOs/CalibrationAttack
翻译:在本研究中,我们重点提出并系统研究了校准攻击——一种旨在诱使受害模型产生严重校准误差而不改变其预测标签的对抗性攻击形式,从而危及模型的可信度及基于其置信度的后续决策。我们提出了四种典型的校准攻击形式:低置信度攻击、高置信度攻击、最大校准误差攻击和随机置信度攻击,并在黑盒与白盒两种设置下实施。实验证明,这些攻击对卷积模型和基于注意力机制的模型均高度有效:仅需少量查询即可在保持预测性能不变的情况下严重扭曲置信度。鉴于其潜在危害,我们进一步评估了多种对抗性防御与重校准方法的有效性,包括我们专门为校准攻击设计的防御方案。通过ECE和KS指标分析,我们发现现有方法在处理校准攻击方面仍存在显著局限。据我们所知,这是首个针对校准导向攻击的专项系统性研究。我们希望本研究有助于引起学界对此类攻击的更多关注,从而遏制其可能造成的严重损害。为此,本研究还提供了详细分析以深入理解此类攻击的特性。代码已开源:https://github.com/PhenetOs/CalibrationAttack