Efficient Content-based Recommendation Model Training via Noise-aware Coreset Selection

Content-based recommendation systems (CRSs) utilize content features to predict user-item interactions, serving as essential tools for helping users navigate information-rich web services. However, ensuring the effectiveness of CRSs requires large-scale and even continuous model training to accommodate diverse user preferences, resulting in significant computational costs and resource demands. A promising approach to this challenge is coreset selection, which identifies a small but representative subset of data samples that preserves model quality while reducing training overhead. Yet, the selected coreset is vulnerable to the pervasive noise in user-item interactions, particularly when it is minimally sized. To this end, we propose Noise-aware Coreset Selection (NaCS), a specialized framework for CRSs. NaCS constructs coresets through submodular optimization based on training gradients, while simultaneously correcting noisy labels using a progressively trained model. Meanwhile, we refine the selected coreset by filtering out low-confidence samples through uncertainty quantification, thereby avoid training with unreliable interactions. Through extensive experiments, we show that NaCS produces higher-quality coresets for CRSs while achieving better efficiency than existing coreset selection techniques. Notably, NaCS recovers 93-95\% of full-dataset training performance using merely 1\% of the training data. The source code is available at \href{https://github.com/chenxing1999/nacs}{https://github.com/chenxing1999/nacs}.

翻译：内容推荐系统利用内容特征预测用户与物品的交互，是帮助用户在信息丰富的网络服务中进行导航的重要工具。然而，为确保内容推荐系统的有效性，需要进行大规模甚至持续的模型训练以适应多样化的用户偏好，这导致了显著的计算成本和资源需求。应对这一挑战的一种有效方法是核心集选择，即识别一个规模小但具有代表性的数据样本子集，在保持模型质量的同时降低训练开销。然而，所选核心集容易受到用户-物品交互中普遍存在的噪声影响，尤其是在其规模极小时。为此，我们提出噪声感知核心集选择，这是一个专为内容推荐系统设计的框架。该框架通过基于训练梯度的次模优化构建核心集，同时利用渐进训练的模型校正噪声标签。此外，我们通过不确定性量化过滤掉低置信度样本，从而优化所选核心集，避免使用不可靠的交互进行训练。通过大量实验，我们证明相较于现有核心集选择技术，该框架能为内容推荐系统生成更高质量的核心集，同时实现更优的效率。值得注意的是，仅使用1%的训练数据，该框架即可恢复全数据集训练性能的93-95%。源代码发布于 \href{https://github.com/chenxing1999/nacs}{https://github.com/chenxing1999/nacs}。