Large Multi-modal Encoders for Recommendation

In recent years, the rapid growth of online multimedia services, such as e-commerce platforms, has necessitated the development of personalised recommendation approaches that can encode diverse content about each item. Indeed, modern multi-modal recommender systems exploit diverse features obtained from raw images and item descriptions to enhance the recommendation performance. However, the existing multi-modal recommenders primarily depend on the features extracted individually from different media through pre-trained modality-specific encoders, and exhibit only shallow alignments between different modalities - limiting these systems' ability to capture the underlying relationships between the modalities. In this paper, we investigate the usage of large multi-modal encoders within the specific context of recommender systems, as these have previously demonstrated state-of-the-art effectiveness when ranking items across various domains. Specifically, we tailor two state-of-the-art multi-modal encoders (CLIP and VLMo) for recommendation tasks using a range of strategies, including the exploration of pre-trained and fine-tuned encoders, as well as the assessment of the end-to-end training of these encoders. We demonstrate that pre-trained large multi-modal encoders can generate more aligned and effective user/item representations compared to existing modality-specific encoders across three multi-modal recommendation datasets. Furthermore, we show that fine-tuning these large multi-modal encoders with recommendation datasets leads to an enhanced recommendation performance. In terms of different training paradigms, our experiments highlight the essential role of the end-to-end training of large multi-modal encoders in multi-modal recommendation systems.

翻译：近年来，随着电子商务平台等在线多媒体服务的快速增长，亟需开发能够编码物品多样化内容的个性化推荐方法。现代多模态推荐系统利用从原始图像和物品描述中提取的多样化特征来提升推荐性能。然而，现有多模态推荐器主要依赖通过预训练模态专用编码器从不同媒体中单独提取的特征，且仅在不同模态间实现浅层对齐，这限制了系统捕捉模态间潜在关联的能力。本文在推荐系统的特定场景下研究大规模多模态编码器的应用——此类编码器此前已在跨域物品排序任务中展现出最先进的有效性。具体而言，我们采用多种策略为推荐任务定制两种最先进的多模态编码器（CLIP和VLMo），包括探索预训练与微调编码器，以及评估这些编码器的端到端训练效果。实验表明，在三个多模态推荐数据集上，预训练的大规模多模态编码器相比现有模态专用编码器能生成更具对齐性和有效性的用户/物品表征。此外，我们发现使用推荐数据集微调这些大规模多模态编码器可提升推荐性能。针对不同训练范式的对比实验凸显了大规模多模态编码器端到端训练在多模态推荐系统中的关键作用。