Hard-label black-box settings, where only top-1 predicted labels are observable, pose a fundamentally constrained yet practically important feedback model for understanding model behavior. A central challenge in this regime is whether meaningful gradient information can be recovered from such discrete responses. In this work, we develop a unified theoretical perspective showing that a wide range of existing sign-flipping hard-label attacks can be interpreted as implicitly approximating the sign of the true loss gradient. This observation reframes hard-label attacks from heuristic search procedures into instances of gradient sign recovery under extremely limited feedback. Motivated by this first-principles understanding, we propose a new attack framework that combines a zero-query frequency-domain initialization with a Pattern-Driven Optimization (PDO) strategy. We establish theoretical guarantees demonstrating that, under mild assumptions, our initialization achieves higher expected cosine similarity to the true gradient sign compared to random baselines, while the proposed PDO procedure attains substantially lower query complexity than existing structured search approaches. We empirically validate our framework through extensive experiments on CIFAR-10, ImageNet, and ObjectNet, covering standard and adversarially trained models, commercial APIs, and CLIP-based models. The results show that our method consistently surpasses SOTA hard-label attacks in both attack success rate and query efficiency, particularly in low-query regimes. Beyond image classification, our approach generalizes effectively to corrupted data, biomedical datasets, and dense prediction tasks. Notably, it also successfully circumvents Blacklight, a SOTA stateful defense, resulting in a $0\%$ detection rate. Our code will be released publicly soon at https://github.com/csjunjun/DPAttack.git.
翻译:硬标签黑箱设置中,仅可观测到top-1预测标签,这构成了一种本质上受限但实际重要的模型行为理解反馈模型。该场景下的核心挑战在于,能否从这类离散响应中恢复有意义的梯度信息。本文发展了一个统一的理论视角,表明现有多种基于符号翻转的硬标签攻击方法可被解释为隐式近似真实损失梯度的符号。这一发现将硬标签攻击从启发式搜索过程重新定义为极端有限反馈下的梯度符号恢复实例。基于这一第一性原理理解,我们提出了一种新型攻击框架,该框架将零查询频域初始化与模式驱动优化策略相结合。我们建立了理论保证,证明在温和假设下,我们的初始化相较于随机基线能达到与真实梯度符号更高的期望余弦相似度,而所提出的PDO过程相比现有结构化搜索方法实现了显著更低的查询复杂度。通过在CIFAR-10、ImageNet和ObjectNet上覆盖标准模型、对抗训练模型、商业API及基于CLIP模型的广泛实验,我们对该框架进行了实证验证。结果显示,我们的方法在攻击成功率和查询效率上持续超越现有最优硬标签攻击方法,尤其在低查询场景中表现突出。除图像分类外,该方法还能有效泛化至受损数据、生物医学数据集及密集预测任务。值得注意的是,该方法成功规避了现有最优有状态防御Blacklight,实现了0%的检测率。我们将在https://github.com/csjunjun/DPAttack.git上公开发布代码。