Privacy-preserving data release algorithms have gained increasing attention for their ability to protect user privacy while enabling downstream machine learning tasks. However, the utility of current popular algorithms is not always satisfactory. Mixup of raw data provides a new way of data augmentation, which can help improve utility. However, its performance drastically deteriorates when differential privacy (DP) noise is added. To address this issue, this paper draws inspiration from the recently observed Neural Collapse (NC) phenomenon, which states that the last layer features of a neural network concentrate on the vertices of a simplex as Equiangular Tight Frame (ETF). We propose a scheme to mixup the Neural Collapse features to exploit the ETF simplex structure and release noisy mixed features to enhance the utility of the released data. By using Gaussian Differential Privacy (GDP), we obtain an asymptotic rate for the optimal mixup degree. To further enhance the utility and address the label collapse issue when the mixup degree is large, we propose a Hierarchical sampling method to stratify the mixup samples on a small number of classes. This method remarkably improves utility when the number of classes is large. Extensive experiments demonstrate the effectiveness of our proposed method in protecting against attacks and improving utility. In particular, our approach shows significantly improved utility compared to directly training classification networks with DPSGD on CIFAR100 and MiniImagenet datasets, highlighting the benefits of using privacy-preserving data release. We release reproducible code in https://github.com/Lidonghao1996/NeuroMixGDP.
翻译:隐私保护数据发布算法因其能够在保护用户隐私的同时支持下游机器学习任务而受到越来越多的关注。然而,当前流行算法的效用并不总是令人满意。原始数据的混合提供了一种新的数据增强方式,有助于提升效用。但当加入差分隐私(DP)噪声时,其性能急剧下降。为解决此问题,本文受近期观察到的神经坍缩(NC)现象启发——即神经网络最后一层特征集中于等角紧框架(ETF)单纯形的顶点——提出一种方案,对神经坍缩特征进行混合以利用ETF单纯形结构,并通过释放带噪声的混合特征来增强发布数据的效用。利用高斯差分隐私(GDP),我们得到了最优混合度的渐近速率。为进一步提升效用并解决混合度较大时的标签坍缩问题,我们提出一种分层采样方法,将混合样本分散到少数类别中。当类别数量较多时,该方法显著提升了效用。大量实验证明了所提方法在抵御攻击和提升效用方面的有效性。特别地,在CIFAR100和MiniImagenet数据集上,我们的方法相比直接使用DPSGD训练分类网络展现出显著提升的效用,凸显了隐私保护数据发布的优势。我们在https://github.com/Lidonghao1996/NeuroMixGDP发布了可复现的代码。