Online Distillation-enhanced Multi-modal Transformer for Sequential Recommendation

Multi-modal recommendation systems, which integrate diverse types of information, have gained widespread attention in recent years. However, compared to traditional collaborative filtering-based multi-modal recommendation systems, research on multi-modal sequential recommendation is still in its nascent stages. Unlike traditional sequential recommendation models that solely rely on item identifier (ID) information and focus on network structure design, multi-modal recommendation models need to emphasize item representation learning and the fusion of heterogeneous data sources. This paper investigates the impact of item representation learning on downstream recommendation tasks and examines the disparities in information fusion at different stages. Empirical experiments are conducted to demonstrate the need to design a framework suitable for collaborative learning and fusion of diverse information. Based on this, we propose a new model-agnostic framework for multi-modal sequential recommendation tasks, called Online Distillation-enhanced Multi-modal Transformer (ODMT), to enhance feature interaction and mutual learning among multi-source input (ID, text, and image), while avoiding conflicts among different features during training, thereby improving recommendation accuracy. To be specific, we first introduce an ID-aware Multi-modal Transformer module in the item representation learning stage to facilitate information interaction among different features. Secondly, we employ an online distillation training strategy in the prediction optimization stage to make multi-source data learn from each other and improve prediction robustness. Experimental results on a stream media recommendation dataset and three e-commerce recommendation datasets demonstrate the effectiveness of the proposed two modules, which is approximately 10% improvement in performance compared to baseline models.

翻译：多模态推荐系统通过融合多种类型的信息，近年来受到广泛关注。然而，与传统基于协同过滤的多模态推荐系统相比，多模态序列推荐的研究仍处于起步阶段。不同于仅依赖物品标识符信息并聚焦于网络结构设计的传统序列推荐模型，多模态推荐模型需要强调物品表示学习以及异构数据源的融合。本文研究了物品表示学习对下游推荐任务的影响，并考察了不同阶段信息融合的差异性。通过实证实验，证明了有必要设计一个适用于多种信息协同学习与融合的框架。基于此，我们提出了一种新的模型无关的多模态序列推荐任务框架，称为在线蒸馏增强的多模态Transformer（ODMT），以增强多源输入（ID、文本和图像）之间的特征交互与相互学习，同时避免训练过程中不同特征间的冲突，从而提高推荐准确性。具体而言，我们首先在物品表示学习阶段引入一个ID感知的多模态Transformer模块，以促进不同特征间的信息交互。其次，在预测优化阶段采用在线蒸馏训练策略，使多源数据相互学习并提升预测鲁棒性。在流媒体推荐数据集和三个电子商务推荐数据集上的实验结果表明，所提出的两个模块具有有效性，与基线模型相比性能提升约10%。