Machine Learning as a Service (MLaaS) platforms have gained popularity due to their accessibility, cost-efficiency, scalability, and rapid development capabilities. However, recent research has highlighted the vulnerability of cloud-based models in MLaaS to model extraction attacks. In this paper, we introduce FDINET, a novel defense mechanism that leverages the feature distribution of deep neural network (DNN) models. Concretely, by analyzing the feature distribution from the adversary's queries, we reveal that the feature distribution of these queries deviates from that of the model's training set. Based on this key observation, we propose Feature Distortion Index (FDI), a metric designed to quantitatively measure the feature distribution deviation of received queries. The proposed FDINET utilizes FDI to train a binary detector and exploits FDI similarity to identify colluding adversaries from distributed extraction attacks. We conduct extensive experiments to evaluate FDINET against six state-of-the-art extraction attacks on four benchmark datasets and four popular model architectures. Empirical results demonstrate the following findings FDINET proves to be highly effective in detecting model extraction, achieving a 100% detection accuracy on DFME and DaST. FDINET is highly efficient, using just 50 queries to raise an extraction alarm with an average confidence of 96.08% for GTSRB. FDINET exhibits the capability to identify colluding adversaries with an accuracy exceeding 91%. Additionally, it demonstrates the ability to detect two types of adaptive attacks.
翻译:机器学习即服务(MLaaS)平台因其可访问性、成本效益、可扩展性和快速开发能力而广受欢迎。然而,近期研究表明,MLaaS中的云端模型易受模型窃取攻击。本文提出一种新的防御机制FDINET,该机制利用深度神经网络(DNN)模型的特征分布特性。具体而言,通过分析攻击者查询的特征分布,我们揭示了这些查询的特征分布与模型训练集特征分布存在偏差。基于这一关键发现,我们提出特征畸变指数(FDI)——一种用于定量衡量接收查询特征分布偏差的指标。所提出的FDINET利用FDI训练二分类检测器,并通过FDI相似度识别分布式窃取攻击中的共谋攻击者。我们在四个基准数据集和四种主流模型架构上,针对六种最先进的窃取攻击进行了大量实验来评估FDINET的有效性。实验结果表明:FDINET在检测模型窃取方面具有高效性,对DFME和DaST攻击的检测准确率达100%;FDINET具有高灵敏度,仅需50次查询即能在GTSRB数据集上以平均96.08%的置信度触发窃取告警;FDINET识别共谋攻击者的准确率超过91%;此外,它还能检测两类自适应攻击。