While recommender systems have become an integral component of the Web experience, their heavy reliance on user data raises privacy and security concerns. Substituting user data with synthetic data can address these concerns, but accurately replicating these real-world datasets has been a notoriously challenging problem. Recent advancements in generative AI have demonstrated the impressive capabilities of diffusion models in generating realistic data across various domains. In this work we introduce a Score-based Diffusion Recommendation Model (SDRM), which captures the intricate patterns of real-world datasets required for training highly accurate recommender systems. SDRM allows for the generation of synthetic data that can replace existing datasets to preserve user privacy, or augment existing datasets to address excessive data sparsity. Our method outperforms competing baselines such as generative adversarial networks, variational autoencoders, and recently proposed diffusion models in synthesizing various datasets to replace or augment the original data by an average improvement of 4.30% in Recall@$n$ and 4.65% in NDCG@$n$.
翻译:尽管推荐系统已成为网络体验的重要组成部分,但其对用户数据的重度依赖引发了隐私与安全问题。用合成数据替代用户数据可缓解这些问题,但精确复制这些真实数据集历来是一项艰巨挑战。生成式人工智能领域的最新进展展示了扩散模型在各类域中生成逼真数据的卓越能力。本文提出了一种基于分数的扩散推荐模型(SDRM),该模型能捕捉训练高精度推荐系统所需的真实数据集复杂模式。SDRM可生成合成数据以替代现有数据集来保护用户隐私,或扩充现有数据集以解决数据过度稀疏问题。在合成多种数据集以替代或扩充原始数据时,我们的方法的Recall@$n$和NDCG@$n$分别平均提升4.30%和4.65%,优于生成对抗网络、变分自编码器及近期提出的扩散模型等基线方法。