Text-based sequential recommender systems, while greatly improving recommendation accuracy by incorporating item contexts, are undeniably more expensive to train. By condensing a large dataset into a compact set of synthetic samples for model training, dataset distillation offers a promising solution. However, its adoption in text-based sequential recommendation is non-trivial given the large pool of discrete items. This challenge is further compounded by language model-based item encoding, which makes bi-level optimization commonly used in dataset distillation prohibitively expensive. To this end, we propose First-order dataset distillation for Text-based Sequential Recommendation (FOSTER), which facilitates effectiveness and efficiency via three novel components: (1) stochastic item subset sampling that replaces costly full-corpus embedding extraction at each distillation step; (2) first-order optimization with trajectory-anchored parameter reset to avoid expensive bi-level gradient computation; and (3) regularization that explicitly promotes co-occurrence between semantically similar items in the synthetic sequences. Extensive experiments on three benchmarks show that FOSTER consistently outperforms existing dataset distillation and coreset selection baselines, approximating full-dataset performance using as few as 20 synthetic interaction sequences.
翻译:基于文本的序列推荐系统虽通过融入物品上下文显著提升了推荐精度,但其训练成本亦显著增加。数据蒸馏通过将大规模数据集压缩为紧凑的合成样本集用于模型训练,为此提供了有前景的解决方案。然而,面对离散物品的庞大候选池,该方法在文本序列推荐中的应用颇具挑战。语言模型驱动的物品编码进一步加剧了该问题,使得数据蒸馏中常用的双层优化计算代价过高。为此,我们提出面向文本序列推荐的一阶数据集蒸馏方法(FOSTER),通过三项创新组件平衡效率与效果:(1)随机物品子集采样,替代各蒸馏步骤中耗时的全语料嵌入提取;(2)采用轨迹锚点参数重置的一阶优化,避免昂贵的双层梯度计算;(3)通过显式正则化促进合成序列中语义相似物品的共现模式。在三个基准数据集上的大量实验表明,FOSTER始终优于现有数据蒸馏与核心集选择基线,仅需20条合成交互序列即可逼近全数据集性能。