Multimodal foundation models (MFMs) have revolutionized sequential recommender systems through advanced representation learning. While Parameter-efficient Fine-tuning (PEFT) is commonly used to adapt these models, studies often prioritize parameter efficiency, neglecting GPU memory and training speed. To address this, we introduced the IISAN framework, significantly enhancing efficiency. However, IISAN was limited to symmetrical MFMs and identical text and image encoders, preventing the use of state-of-the-art Large Language Models. To overcome this, we developed IISAN-Versa, a versatile plug-and-play architecture compatible with both symmetrical and asymmetrical MFMs. IISAN-Versa employs a Decoupled PEFT structure and utilizes both intra- and inter-modal adaptation. It effectively handles asymmetry through a simple yet effective combination of group layer-dropping and dimension transformation alignment. Our research demonstrates that IISAN-Versa effectively adapts large text encoders, and we further identify a scaling effect where larger encoders generally perform better. IISAN-Versa also demonstrates strong versatility in our defined multimodal scenarios, which include raw titles and captions generated from images and videos. Additionally, IISAN-Versa achieved state-of-the-art performance on the Microlens public benchmark. We will release our code and datasets to support future research.
翻译:多模态基础模型通过先进的表示学习革新了序列推荐系统。尽管参数高效微调常被用于适配这些模型,但现有研究往往优先考虑参数效率,而忽视了GPU内存和训练速度。为解决这一问题,我们引入了IISAN框架,显著提升了效率。然而,IISAN仅限于对称的多模态基础模型以及相同的文本和图像编码器,无法利用最先进的大型语言模型。为克服此限制,我们开发了IISAN-Versa,一种兼容对称与非对称多模态基础模型的通用即插即用架构。IISAN-Versa采用解耦的参数高效微调结构,并利用模态内与模态间的适配机制。它通过简单而有效的分组层丢弃与维度变换对齐组合,有效处理非对称性。我们的研究表明,IISAN-Versa能有效适配大型文本编码器,并进一步发现了编码器规模越大通常性能越优的缩放效应。IISAN-Versa在我们定义的多模态场景(包括原始标题以及从图像和视频生成的描述)中也展现出强大的通用性。此外,IISAN-Versa在Microlens公共基准测试中取得了最先进的性能。我们将公开代码和数据集以支持未来研究。