Research and education in machine learning needs diverse, representative, and open datasets that contain sufficient samples to handle the necessary training, validation, and testing tasks. Currently, the Recommender Systems area includes a large number of subfields in which accuracy and beyond accuracy quality measures are continuously improved. To feed this research variety, it is necessary and convenient to reinforce the existing datasets with synthetic ones. This paper proposes a Generative Adversarial Network (GAN)-based method to generate collaborative filtering datasets in a parameterized way, by selecting their preferred number of users, items, samples, and stochastic variability. This parameterization cannot be made using regular GANs. Our GAN model is fed with dense, short, and continuous embedding representations of items and users, instead of sparse, large, and discrete vectors, to make an accurate and quick learning, compared to the traditional approach based on large and sparse input vectors. The proposed architecture includes a DeepMF model to extract the dense user and item embeddings, as well as a clustering process to convert from the dense GAN generated samples to the discrete and sparse ones, necessary to create each required synthetic dataset. The results of three different source datasets show adequate distributions and expected quality values and evolutions on the generated datasets compared to the source ones. Synthetic datasets and source codes are available to researchers.
翻译:机器学习教育需要多样化、代表性强的开放数据集,其中应包含足够的样本以处理必要的训练、验证和测试任务。当前推荐系统领域涵盖大量子领域,准确度及超越准确度的质量指标持续改进。为支撑此类研究多样性,有必要且便捷地利用合成数据集增强现有数据集。本文提出一种基于生成对抗网络(GAN)的参数化方法,通过选择用户数量、物品数量、样本数量及随机变异性,生成协同过滤数据集。常规GAN无法实现这种参数化。我们的GAN模型采用稠密、简短且连续的物品与用户嵌入表示,而非稀疏、庞大且离散的向量,相比基于大规模稀疏输入向量的传统方法,能够实现更精准、更快速的学习。所提出的架构包含一个DeepMF模型用于提取稠密的用户与物品嵌入,以及一个聚类过程用于将GAN生成的稠密样本转换为离散稀疏样本,从而创建所需的合成数据集。三个不同源数据集的结果显示,生成的合成数据集相较于源数据集具有恰当的分布、预期的质量值及演化趋势。合成数据集与源代码已向研究人员开放。