Creating Synthetic Datasets for Collaborative Filtering Recommender Systems using Generative Adversarial Networks

Research and education in machine learning needs diverse, representative, and open datasets that contain sufficient samples to handle the necessary training, validation, and testing tasks. Currently, the Recommender Systems area includes a large number of subfields in which accuracy and beyond accuracy quality measures are continuously improved. To feed this research variety, it is necessary and convenient to reinforce the existing datasets with synthetic ones. This paper proposes a Generative Adversarial Network (GAN)-based method to generate collaborative filtering datasets in a parameterized way, by selecting their preferred number of users, items, samples, and stochastic variability. This parameterization cannot be made using regular GANs. Our GAN model is fed with dense, short, and continuous embedding representations of items and users, instead of sparse, large, and discrete vectors, to make an accurate and quick learning, compared to the traditional approach based on large and sparse input vectors. The proposed architecture includes a DeepMF model to extract the dense user and item embeddings, as well as a clustering process to convert from the dense GAN generated samples to the discrete and sparse ones, necessary to create each required synthetic dataset. The results of three different source datasets show adequate distributions and expected quality values and evolutions on the generated datasets compared to the source ones. Synthetic datasets and source codes are available to researchers.

翻译：机器学习教育需要多样化、代表性强的开放数据集，其中应包含足够的样本以处理必要的训练、验证和测试任务。当前推荐系统领域涵盖大量子领域，准确度及超越准确度的质量指标持续改进。为支撑此类研究多样性，有必要且便捷地利用合成数据集增强现有数据集。本文提出一种基于生成对抗网络（GAN）的参数化方法，通过选择用户数量、物品数量、样本数量及随机变异性，生成协同过滤数据集。常规GAN无法实现这种参数化。我们的GAN模型采用稠密、简短且连续的物品与用户嵌入表示，而非稀疏、庞大且离散的向量，相比基于大规模稀疏输入向量的传统方法，能够实现更精准、更快速的学习。所提出的架构包含一个DeepMF模型用于提取稠密的用户与物品嵌入，以及一个聚类过程用于将GAN生成的稠密样本转换为离散稀疏样本，从而创建所需的合成数据集。三个不同源数据集的结果显示，生成的合成数据集相较于源数据集具有恰当的分布、预期的质量值及演化趋势。合成数据集与源代码已向研究人员开放。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

如何使用TensorFlow 排序构建推荐系统? How to build a recommendation system using TensorFlow Ranking?

专知会员服务

19+阅读 · 2022年3月13日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

105+阅读 · 2022年2月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日