Deep learning models can be vulnerable to recovery attacks, raising privacy concerns to users, and widespread algorithms such as empirical risk minimization (ERM) often do not directly enforce safety guarantees. In this paper, we study the safety of ERM-trained models against a family of powerful black-box attacks. Our analysis quantifies this safety via two separate terms: (i) the model stability with respect to individual training samples, and (ii) the feature alignment between the attacker query and the original data. While the first term is well established in learning theory and it is connected to the generalization error in classical work, the second one is, to the best of our knowledge, novel. Our key technical result provides a precise characterization of the feature alignment for the two prototypical settings of random features (RF) and neural tangent kernel (NTK) regression. This proves that privacy strengthens with an increase in the generalization capability, unveiling also the role of the activation function. Numerical experiments show a behavior in agreement with our theory not only for the RF and NTK models, but also for deep neural networks trained on standard datasets (MNIST, CIFAR-10).
翻译:深度学习模型可能面临恢复攻击的威胁,引发用户隐私担忧,而经验风险最小化(ERM)等广泛使用的算法通常无法直接提供安全保证。本文研究了经过ERM训练的模型对一类强大黑盒攻击的安全性。我们的分析通过两个独立项量化了这种安全性:(i)模型相对于单个训练样本的稳定性,以及(ii)攻击者查询与原始数据之间的特征对齐。第一项在学习理论中已得到充分确立,并与经典研究中的泛化误差相关联;而据我们所知,第二项是首次提出的新概念。我们的关键技术成果在于,针对随机特征(RF)和神经正切核(NTK)回归这两个典型场景,精确刻画了特征对齐特性。这表明隐私保护能力随泛化能力的提升而增强,同时揭示了激活函数的作用。数值实验不仅验证了RF和NTK模型与理论预测的一致性,还证明了在标准数据集(MNIST、CIFAR-10)上训练的深度神经网络也遵循这一规律。