Generative recommendation (GR) has become a powerful paradigm in recommendation systems that implicitly links modality and semantics to item representation, in contrast to previous methods that relied on non-semantic item identifiers in autoregressive models. However, previous research has predominantly treated modalities in isolation, typically assuming item content is unimodal (usually text). We argue that this is a significant limitation given the rich, multimodal nature of real-world data and the potential sensitivity of GR models to modality choices and usage. Our work aims to explore the critical problem of Multimodal Generative Recommendation (MGR), highlighting the importance of modality choices in GR nframeworks. We reveal that GR models are particularly sensitive to different modalities and examine the challenges in achieving effective GR when multiple modalities are available. By evaluating design strategies for effectively leveraging multiple modalities, we identify key challenges and introduce MGR-LF++, an enhanced late fusion framework that employs contrastive modality alignment and special tokens to denote different modalities, achieving a performance improvement of over 20% compared to single-modality alternatives.
翻译:生成式推荐已成为推荐系统中的一种强大范式,它隐式地将模态与语义关联到物品表示中,这与先前在自回归模型中依赖非语义物品标识符的方法形成对比。然而,先前的研究主要孤立地处理模态,通常假设物品内容是单模态的(通常是文本)。我们认为这是一个显著的限制,考虑到现实世界数据丰富的多模态性质,以及生成式推荐模型对模态选择和使用可能存在的敏感性。我们的工作旨在探索多模态生成式推荐这一关键问题,强调模态选择在生成式推荐框架中的重要性。我们揭示了生成式推荐模型对不同模态特别敏感,并考察了在多种模态可用时实现有效生成式推荐所面临的挑战。通过评估有效利用多种模态的设计策略,我们识别出关键挑战,并提出了MGR-LF++,这是一个增强的后期融合框架,它采用对比模态对齐和特殊标记来表示不同模态,与单模态方案相比,实现了超过20%的性能提升。