Score-based query attacks pose a serious threat to deep learning models by crafting adversarial examples (AEs) using only black-box access to model output scores, iteratively optimizing inputs based on observed loss values. While recent runtime defenses attempt to disrupt this process via output perturbation, most either require access to model parameters or fail when attackers adapt their tactics. In this paper, we first reveal that even the state-of-the-art plug-and-play defense can be bypassed by adaptive attacks, exposing a critical limitation of existing runtime defenses. We then propose Dashed Line Defense (DLD), a plug-and-play post-processing method specifically designed to withstand adaptive query strategies. By introducing ambiguity in how the observed loss reflects the true adversarial strength of candidate examples, DLD prevents attackers from reliably analyzing and adapting their queries, effectively disrupting the AE generation process. We provide theoretical guarantees of DLD's defense capability and validate its effectiveness through experiments on ImageNet, demonstrating that DLD consistently outperforms prior defenses--even under worst-case adaptive attacks--while preserving the model's predicted labels.
翻译:基于分数的查询攻击通过仅利用模型输出分数的黑盒访问权限,根据观测到的损失值迭代优化输入,从而生成对抗样本,对深度学习模型构成严重威胁。尽管近期运行时防御尝试通过输出扰动来干扰此过程,但大多数方法要么需要访问模型参数,要么在攻击者调整其策略时失效。本文首先揭示,即使是最先进的即插即用防御也能被自适应攻击绕过,这暴露了现有运行时防御的关键局限性。随后我们提出虚线防御,一种专门设计用于抵御自适应查询策略的即插即用后处理方法。通过在观测损失如何反映候选样本真实对抗强度方面引入模糊性,DLD 阻止攻击者可靠地分析和调整其查询,从而有效破坏对抗样本生成过程。我们提供了 DLD 防御能力的理论保证,并通过在 ImageNet 上的实验验证其有效性,结果表明 DLD 始终优于现有防御方法——即使在最坏情况的自适应攻击下——同时保持模型的预测标签不变。