Training recommendation models on large datasets often requires significant time and computational resources. Consequently, an emergent imperative has arisen to construct informative, smaller-scale datasets for efficiently training. Dataset compression techniques explored in other domains show potential possibility to address this problem, via sampling a subset or synthesizing a small dataset. However, applying existing approaches to condense recommendation datasets is impractical due to following challenges: (i) sampling-based methods are inadequate in addressing the long-tailed distribution problem; (ii) synthesizing-based methods are not applicable due to discreteness of interactions and large size of recommendation datasets; (iii) neither of them fail to address the specific issue in recommendation of false negative items, where items with potential user interest are incorrectly sampled as negatives owing to insufficient exposure. To bridge this gap, we investigate dataset condensation for recommendation, where discrete interactions are continualized with probabilistic re-parameterization. To avoid catastrophically expensive computations, we adopt a one-step update strategy for inner model training and introducing policy gradient estimation for outer dataset synthesis. To mitigate amplification of long-tailed problem, we compensate long-tailed users in the condensed dataset. Furthermore, we propose to utilize a proxy model to identify false negative items. Theoretical analysis regarding the convergence property is provided. Extensive experiments on multiple datasets demonstrate the efficacy of our method. In particular, we reduce the dataset size by 75% while approximating over 98% of the original performance on Dianping and over 90% on other datasets.
翻译:大规模数据集上训练推荐模型通常需要大量的时间和计算资源。因此,构建信息丰富的小规模数据集以高效训练成为一个迫切需求。其他领域探索的数据集压缩技术通过采样子集或合成小数据集,为这一问题提供了潜在解决方案。然而,将现有方法应用于推荐数据集压缩并不实际,原因如下:(i)基于采样的方法难以应对长尾分布问题;(ii)基于合成的方法因交互的离散性和推荐数据集的大规模特性而不可行;(iii)两种方法均未能解决推荐中特有的假阴性样本问题,即由于曝光不足,潜在用户兴趣项目被错误采样为负样本。为弥补这一差距,我们研究了面向推荐的数据集压缩,通过概率重参数化将离散交互连续化。为避免灾难性的高计算成本,我们采用单步更新策略进行内部模型训练,并引入策略梯度估计用于外部数据集合成。为缓解长尾问题的加剧,我们在压缩数据集中对长尾用户进行补偿。此外,我们提出利用代理模型识别假阴性项目。本文提供了关于收敛性的理论分析,并在多个数据集上进行了大量实验,验证了我们方法的有效性。具体而言,我们将数据集大小减少75%,同时在大众点评数据集上保留超过98%的原始性能,在其他数据集上保留超过90%的性能。