Dataset Condensation for Recommendation

Training recommendation models on large datasets often requires significant time and computational resources. Consequently, an emergent imperative has arisen to construct informative, smaller-scale datasets for efficiently training. Dataset compression techniques explored in other domains show potential possibility to address this problem, via sampling a subset or synthesizing a small dataset. However, applying existing approaches to condense recommendation datasets is impractical due to following challenges: (i) sampling-based methods are inadequate in addressing the long-tailed distribution problem; (ii) synthesizing-based methods are not applicable due to discreteness of interactions and large size of recommendation datasets; (iii) neither of them fail to address the specific issue in recommendation of false negative items, where items with potential user interest are incorrectly sampled as negatives owing to insufficient exposure. To bridge this gap, we investigate dataset condensation for recommendation, where discrete interactions are continualized with probabilistic re-parameterization. To avoid catastrophically expensive computations, we adopt a one-step update strategy for inner model training and introducing policy gradient estimation for outer dataset synthesis. To mitigate amplification of long-tailed problem, we compensate long-tailed users in the condensed dataset. Furthermore, we propose to utilize a proxy model to identify false negative items. Theoretical analysis regarding the convergence property is provided. Extensive experiments on multiple datasets demonstrate the efficacy of our method. In particular, we reduce the dataset size by 75% while approximating over 98% of the original performance on Dianping and over 90% on other datasets.

翻译：大规模数据集上训练推荐模型通常需要大量的时间和计算资源。因此，构建信息丰富的小规模数据集以高效训练成为一个迫切需求。其他领域探索的数据集压缩技术通过采样子集或合成小数据集，为这一问题提供了潜在解决方案。然而，将现有方法应用于推荐数据集压缩并不实际，原因如下：（i）基于采样的方法难以应对长尾分布问题；（ii）基于合成的方法因交互的离散性和推荐数据集的大规模特性而不可行；（iii）两种方法均未能解决推荐中特有的假阴性样本问题，即由于曝光不足，潜在用户兴趣项目被错误采样为负样本。为弥补这一差距，我们研究了面向推荐的数据集压缩，通过概率重参数化将离散交互连续化。为避免灾难性的高计算成本，我们采用单步更新策略进行内部模型训练，并引入策略梯度估计用于外部数据集合成。为缓解长尾问题的加剧，我们在压缩数据集中对长尾用户进行补偿。此外，我们提出利用代理模型识别假阴性项目。本文提供了关于收敛性的理论分析，并在多个数据集上进行了大量实验，验证了我们方法的有效性。具体而言，我们将数据集大小减少75%，同时在大众点评数据集上保留超过98%的原始性能，在其他数据集上保留超过90%的性能。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日