Malicious users attempt to replicate commercial models functionally at low cost by training a clone model with query responses. It is challenging to timely prevent such model-stealing attacks to achieve strong protection and maintain utility. In this paper, we propose a novel non-parametric detector called Account-aware Distribution Discrepancy (ADD) to recognize queries from malicious users by leveraging account-wise local dependency. We formulate each class as a Multivariate Normal distribution (MVN) in the feature space and measure the malicious score as the sum of weighted class-wise distribution discrepancy. The ADD detector is combined with random-based prediction poisoning to yield a plug-and-play defense module named D-ADD for image classification models. Results of extensive experimental studies show that D-ADD achieves strong defense against different types of attacks with little interference in serving benign users for both soft and hard-label settings.
翻译:恶意用户试图通过利用查询响应训练克隆模型,以低成本实现商业模型的功能复制。及时防范此类模型窃取攻击以实现强保护并维持模型效用具有挑战性。本文提出一种名为账户感知分布差异(ADD)的新型非参数检测器,通过利用账户级局部依赖性识别恶意用户查询。我们将每个类别建模为特征空间中的多元正态分布(MVN),并将恶意评分量化为加权类间分布差异之和。该ADD检测器与基于随机的预测投毒相结合,构建了一个即插即用的防御模块D-ADD,适用于图像分类模型。大量实验研究表明,D-ADD在软标签和硬标签设置下均能有效抵御各类攻击,同时对良性用户服务的干扰极小。