Generative Adversarial Networks (GANs) have demonstrated their ability to generate synthetic samples that match a target distribution. However, from a privacy perspective, using GANs as a proxy for data sharing is not a safe solution, as they tend to embed near-duplicates of real samples in the latent space. Recent works, inspired by k-anonymity principles, address this issue through sample aggregation in the latent space, with the drawback of reducing the dataset by a factor of k. Our work aims to mitigate this problem by proposing a latent space navigation strategy able to generate diverse synthetic samples that may support effective training of deep models, while addressing privacy concerns in a principled way. Our approach leverages an auxiliary identity classifier as a guide to non-linearly walk between points in the latent space, minimizing the risk of collision with near-duplicates of real samples. We empirically demonstrate that, given any random pair of points in the latent space, our walking strategy is safer than linear interpolation. We then test our path-finding strategy combined to k-same methods and demonstrate, on two benchmarks for tuberculosis and diabetic retinopathy classification, that training a model using samples generated by our approach mitigate drops in performance, while keeping privacy preservation.
翻译:生成对抗网络(GANs)已证明其能够生成匹配目标分布的合成样本。然而,从隐私角度看,使用GANs作为数据共享的代理并非安全方案,因为它们倾向于在潜空间中嵌入真实样本的近似副本。近期受k-匿名原则启发的相关工作,通过潜空间中的样本聚合来解决该问题,但其代价是将数据集缩小k倍。本研究旨在缓解此问题,提出一种潜空间导航策略,能够生成多样化的合成样本以支持深度模型的有效训练,同时以原则性方式处理隐私关切。我们的方法利用辅助身份分类器作为引导,在潜空间中的点之间进行非线性漫步,从而最小化与真实样本近似副本碰撞的风险。实验证明,给定潜空间中任意随机点对,我们的漫步策略比线性插值更安全。我们进一步将路径寻找策略与k-同质方法结合,在肺结核和糖尿病视网膜病变分类两个基准测试中证明,使用本方法生成的样本训练模型可在保持隐私保护的同时缓解性能下降。