Online Distillation-enhanced Multi-modal Transformer for Sequential Recommendation

Multi-modal recommendation systems, which integrate diverse types of information, have gained widespread attention in recent years. However, compared to traditional collaborative filtering-based multi-modal recommendation systems, research on multi-modal sequential recommendation is still in its nascent stages. Unlike traditional sequential recommendation models that solely rely on item identifier (ID) information and focus on network structure design, multi-modal recommendation models need to emphasize item representation learning and the fusion of heterogeneous data sources. This paper investigates the impact of item representation learning on downstream recommendation tasks and examines the disparities in information fusion at different stages. Empirical experiments are conducted to demonstrate the need to design a framework suitable for collaborative learning and fusion of diverse information. Based on this, we propose a new model-agnostic framework for multi-modal sequential recommendation tasks, called Online Distillation-enhanced Multi-modal Transformer (ODMT), to enhance feature interaction and mutual learning among multi-source input (ID, text, and image), while avoiding conflicts among different features during training, thereby improving recommendation accuracy. To be specific, we first introduce an ID-aware Multi-modal Transformer module in the item representation learning stage to facilitate information interaction among different features. Secondly, we employ an online distillation training strategy in the prediction optimization stage to make multi-source data learn from each other and improve prediction robustness. Experimental results on a video content recommendation dataset and three e-commerce recommendation datasets demonstrate the effectiveness of the proposed two modules, which is approximately 10% improvement in performance compared to baseline models.

翻译：多模态推荐系统整合了多种类型的信息，近年来受到广泛关注。然而，与传统的基于协同过滤的多模态推荐系统相比，多模态序列推荐的研究仍处于起步阶段。传统的序列推荐模型仅依赖物品标识符（ID）信息并专注于网络结构设计，而多模态推荐模型则需要强调物品表示学习与异质数据源的融合。本文探究了物品表示学习对下游推荐任务的影响，并分析了不同阶段信息融合的差异。通过实证实验证明了需要设计一种适用于多样信息协同学习与融合的框架。基于此，我们提出了一种新的模型无关的多模态序列推荐框架，称为在线蒸馏增强的多模态Transformer（ODMT），以增强多源输入（ID、文本和图像）的特征交互和相互学习，同时避免训练过程中不同特征间的冲突，从而提高推荐准确率。具体而言，首先在物品表示学习阶段引入ID感知的多模态Transformer模块，促进不同特征间的信息交互；其次在预测优化阶段采用在线蒸馏训练策略，使多源数据相互学习并提升预测鲁棒性。在视频内容推荐数据集和三个电商推荐数据集上的实验结果表明，所提出的两个模块效果显著，相比基线模型性能提升约10%。