CADC: Encoding User-Item Interactions for Compressing Recommendation Model Training Data

Deep learning recommendation models (DLRMs) are at the heart of the current e-commerce industry. However, the amount of training data used to train these large models is growing exponentially, leading to substantial training hurdles. The training dataset contains two primary types of information: content-based information (features of users and items) and collaborative information (interactions between users and items). One approach to reduce the training dataset is to remove user-item interactions. But that significantly diminishes collaborative information, which is crucial for maintaining accuracy due to its inclusion of interaction histories. This loss profoundly impacts DLRM performance. This paper makes an important observation that if one can capture the user-item interaction history to enrich the user and item embeddings, then the interaction history can be compressed without losing model accuracy. Thus, this work, Collaborative Aware Data Compression (CADC), takes a two-step approach to training dataset compression. In the first step, we use matrix factorization of the user-item interaction matrix to create a novel embedding representation for both the users and items. Once the user and item embeddings are enriched by the interaction history information the approach then applies uniform random sampling of the training dataset to drastically reduce the training dataset size while minimizing model accuracy drop. The source code of CADC is available at \href{https://anonymous.4open.science/r/DSS-RM-8C1D/README.md}{https://anonymous.4open.science/r/DSS-RM-8C1D/README.md}.

翻译：深度学习推荐模型（DLRM）是当前电子商务行业的核心。然而，用于训练这些大型模型的训练数据量呈指数级增长，带来了巨大的训练挑战。训练数据集包含两种主要类型的信息：基于内容的信息（用户和物品的特征）和协同信息（用户与物品之间的交互）。减少训练数据集的一种方法是移除用户-物品交互记录。但这会显著削弱协同信息，而协同信息因包含交互历史对保持模型准确性至关重要。这种损失会深刻影响DLRM的性能。本文提出一个重要发现：若能捕捉用户-物品交互历史以增强用户和物品的嵌入表示，则可在不损失模型准确性的前提下压缩交互历史。因此，本研究提出的协同感知数据压缩（CADC）采用两步法实现训练数据集压缩：第一步，通过对用户-物品交互矩阵进行矩阵分解，为用户和物品创建新颖的嵌入表示；当用户和物品嵌入通过交互历史信息增强后，该方法对训练数据集实施均匀随机采样，从而在最小化模型准确性下降的同时大幅缩减训练数据集规模。CADC的源代码发布于\href{https://anonymous.4open.science/r/DSS-RM-8C1D/README.md}{https://anonymous.4open.science/r/DSS-RM-8C1D/README.md}。