Machine Learning (ML) is crucial in many sectors, including computer vision. However, ML models trained on sensitive data face security challenges, as they can be attacked and leak information. Privacy-Preserving Machine Learning (PPML) addresses this by using Differential Privacy (DP) to balance utility and privacy. This study identifies image dataset characteristics that affect the utility and vulnerability of private and non-private Convolutional Neural Network (CNN) models. Through analyzing multiple datasets and privacy budgets, we find that imbalanced datasets increase vulnerability in minority classes, but DP mitigates this issue. Datasets with fewer classes improve both model utility and privacy, while high entropy or low Fisher Discriminant Ratio (FDR) datasets deteriorate the utility-privacy trade-off. These insights offer valuable guidance for practitioners and researchers in estimating and optimizing the utility-privacy trade-off in image datasets, helping to inform data and privacy modifications for better outcomes based on dataset characteristics.
翻译:机器学习(ML)在包括计算机视觉在内的许多领域至关重要。然而,基于敏感数据训练的ML模型面临安全挑战,因为它们可能遭受攻击并泄露信息。隐私保护机器学习(PPML)通过使用差分隐私(DP)来平衡效用与隐私,以解决这一问题。本研究识别了影响私有和非私有卷积神经网络(CNN)模型效用与脆弱性的图像数据集特征。通过分析多个数据集和隐私预算,我们发现不平衡数据集增加了少数类别的脆弱性,但DP缓解了这一问题。类别较少的数据集同时提升了模型效用和隐私性,而高熵或低费舍尔判别比(FDR)的数据集则恶化了效用-隐私权衡。这些见解为从业者和研究人员在评估及优化图像数据集的效用-隐私权衡方面提供了宝贵指导,有助于根据数据集特征调整数据和隐私设置以获得更优结果。