Computer Vision (CV), Natural Language Processing (NLP), and Recommender Systems (RecSys) are three prominent AI applications that have traditionally developed independently, resulting in disparate modeling and engineering methodologies. This has impeded the ability for these fields to directly benefit from each other's advancements. With the recent development of foundation models, large language models have emerged as a potential general-purpose interface for unifying different modalities and problem formulations. In light of this, we propose the development of a multimodal foundation model (MFM) considering visual, textual, and personalization modalities under the P5 recommendation paradigm, thus named VIP5 (Visual P5), to unify various modalities and recommendation tasks. This will enable the processing of multiple modalities in a shared architecture for improved recommendations. To achieve this, we introduce multimodal personalized prompts to accommodate multiple modalities under a shared format. Additionally, we propose a parameter-efficient training method for foundation models, which involves freezing the P5 backbone and fine-tuning lightweight adapters, resulting in improved recommendation performance and increased efficiency in terms of training time and memory usage. Code and data of VIP5 are available at https://github.com/jeykigung/VIP5.
翻译:计算机视觉(CV)、自然语言处理(NLP)与推荐系统(RecSys)作为三大核心人工智能应用,传统上各自独立发展,导致建模与工程方法论存在显著差异,阻碍了不同领域直接吸收彼此的技术进步。随着近期基础模型的发展,大语言模型已成为统一多模态与问题建模的潜在通用接口。基于此,我们提出在P5推荐范式下构建融合视觉、文本与个性化模态的多模态基础模型(MFM),命名为VIP5(Visual P5),以统一多种模态与推荐任务。这将实现共享架构下的多模态联合处理,从而优化推荐效果。为此,我们设计多模态个性化提示,支持在统一格式下兼容多种模态。同时,我们提出参数高效的基础模型训练方法:冻结P5主干网络并微调轻量级适配器,在提升推荐性能的同时显著优化训练速度与内存占用。VIP5的代码与数据已开源至 https://github.com/jeykigung/VIP5。