While personalized recommendations systems have become increasingly popular, ensuring user data protection remains a paramount concern in the development of these learning systems. A common approach to enhancing privacy involves training models using anonymous data rather than individual data. In this paper, we explore a natural technique called \emph{look-alike clustering}, which involves replacing sensitive features of individuals with the cluster's average values. We provide a precise analysis of how training models using anonymous cluster centers affects their generalization capabilities. We focus on an asymptotic regime where the size of the training set grows in proportion to the features dimension. Our analysis is based on the Convex Gaussian Minimax Theorem (CGMT) and allows us to theoretically understand the role of different model components on the generalization error. In addition, we demonstrate that in certain high-dimensional regimes, training over anonymous cluster centers acts as a regularization and improves generalization error of the trained models. Finally, we corroborate our asymptotic theory with finite-sample numerical experiments where we observe a perfect match when the sample size is only of order of a few hundreds.
翻译:尽管个性化推荐系统日益普及,确保用户数据保护仍是这类学习系统开发中的首要关切。增强隐私保护的一种常见方法是使用匿名数据而非个体数据来训练模型。本文探索了一种被称为"相似聚类"的自然技术,该方法用聚类平均值替代个体的敏感特征。我们精确分析了使用匿名聚类中心训练模型对其泛化能力的影响。研究聚焦于训练集规模与特征维度成比例增长的新近极限区域。基于凸高斯极小极大定理(CGMT),我们能够从理论上理解不同模型组件对泛化误差的作用。此外,研究表明在高维特定区域内,基于匿名聚类中心的训练能起到正则化作用,有效降低训练模型的泛化误差。最后,我们通过有限样本数值实验验证了这一渐近理论,发现当样本量仅达数百量级时,理论结果与实际表现完美吻合。