Machine Learning (ML) has achieved enormous success in solving a variety of problems in computer vision, speech recognition, object detection, to name a few. The principal reason for this success is the availability of huge datasets for training deep neural networks (DNNs). However, datasets can not be publicly released if they contain sensitive information such as medical or financial records. In such cases, data privacy becomes a major concern. Encryption methods offer a possible solution to this issue, however their deployment on ML applications is non-trivial, as they seriously impact the classification accuracy and result in substantial computational overhead.Alternatively, obfuscation techniques can be used, but maintaining a good balance between visual privacy and accuracy is challenging. In this work, we propose a method to generate secure synthetic datasets from the original private datasets. In our method, given a network with Batch Normalization (BN) layers pre-trained on the original dataset, we first record the layer-wise BN statistics. Next, using the BN statistics and the pre-trained model, we generate the synthetic dataset by optimizing random noises such that the synthetic data match the layer-wise statistical distribution of the original model. We evaluate our method on image classification dataset (CIFAR10) and show that our synthetic data can be used for training networks from scratch, producing reasonable classification performance.
翻译:机器学习(ML)在解决计算机视觉、语音识别、目标检测等诸多问题方面取得了巨大成功。这一成功的主要原因是用于训练深度神经网络(DNNs)的大规模数据集的可用性。然而,如果数据集包含敏感信息(如医疗或财务记录),则无法公开发布。在这种情况下,数据隐私成为一个主要问题。加密方法为此问题提供了一种可能的解决方案,但它们在机器学习应用中的部署并非易事,因为这会严重影响分类精度并导致巨大的计算开销。另一种方法是使用混淆技术,但在视觉隐私和精度之间保持良好的平衡具有挑战性。在这项工作中,我们提出了一种从原始私有数据集中生成安全合成数据集的方法。在我们的方法中,给定一个在原始数据集上预训练并包含批归一化(BN)层的网络,我们首先记录逐层的BN统计量。接着,利用BN统计量和预训练模型,我们通过优化随机噪声来生成合成数据集,使得合成数据匹配原始模型的逐层统计分布。我们在图像分类数据集(CIFAR10)上评估了我们的方法,结果表明合成数据可用于从头训练网络,并产生合理的分类性能。