We introduce a new framework of adversarial attacks, named calibration attacks, in which the attacks are generated and organized to trap victim models to be miscalibrated without altering their original accuracy, hence seriously endangering the trustworthiness of the models and any decision-making based on their confidence scores. Specifically, we identify four novel forms of calibration attacks: underconfidence attacks, overconfidence attacks, maximum miscalibration attacks, and random confidence attacks, in both the black-box and white-box setups. We then test these new attacks on typical victim models with comprehensive datasets, demonstrating that even with a relatively low number of queries, the attacks can create significant calibration mistakes. We further provide detailed analyses to understand different aspects of calibration attacks. Building on that, we investigate the effectiveness of widely used adversarial defences and calibration methods against these types of attacks, which then inspires us to devise two novel defences against such calibration attacks.
翻译:我们提出了一种名为“校准攻击”的新型对抗攻击框架,此类攻击通过生成并组织对抗样本,在不改变受害者模型原始准确率的前提下,诱使模型校准失调,从而严重破坏模型的可信度以及基于其置信度得分的任何决策。具体而言,我们识别出四种新型校准攻击形式:欠自信攻击、过自信攻击、最大校准误差攻击和随机置信度攻击,这些攻击同时适用于黑盒和白盒设置。随后,我们使用全面的数据集在典型受害者模型上测试了这些新型攻击,结果表明即使在查询次数相对较少的情况下,这些攻击也能造成显著的校准错误。我们进一步提供了详细分析,以理解校准攻击的不同方面。基于此,我们研究了广泛使用的对抗防御方法和校准方法针对此类攻击的有效性,这进而启发我们设计出两种针对此类校准攻击的新型防御方法。