Computer Vision (CV), Natural Language Processing (NLP), and Recommender Systems (RecSys) are three prominent AI applications that have traditionally developed independently, resulting in disparate modeling and engineering methodologies. This has impeded the ability for these fields to directly benefit from each other's advancements. With the increasing availability of multimodal data on the web, there is a growing need to consider various modalities when making recommendations for users. With the recent emergence of foundation models, large language models have emerged as a potential general-purpose interface for unifying different modalities and problem formulations. In light of this, we propose the development of a multimodal foundation model by considering both visual and textual modalities under the P5 recommendation paradigm (VIP5) to unify various modalities and recommendation tasks. This will enable the processing of vision, language, and personalization information in a shared architecture for improved recommendations. To achieve this, we introduce multimodal personalized prompts to accommodate multiple modalities under a shared format. Additionally, we propose a parameter-efficient training method for foundation models, which involves freezing the backbone and fine-tuning lightweight adapters, resulting in improved recommendation performance and increased efficiency in terms of training time and memory usage.
翻译:计算机视觉、自然语言处理和推荐系统是三种突出的人工智能应用,传统上它们独立发展,导致建模和工程方法各异,这阻碍了这些领域直接相互借鉴各自的进步。随着网络上多模态数据的日益丰富,为用户进行推荐时需要考虑多种模态的需求也在增长。近年来基础模型的出现,使大型语言模型成为统一不同模态和问题表述的潜在通用接口。鉴于此,我们提出在P5推荐范式下开发一个同时考虑视觉和文本模态的多模态基础模型(VIP5),以统一多种模态和推荐任务。这将使视觉、语言和个性化信息能够在共享架构中得以处理,从而改进推荐效果。为实现这一目标,我们引入了多模态个性化提示,以在共享格式下容纳多种模态。此外,我们提出了一种针对基础模型的参数高效训练方法,该方法涉及冻结主干网络并微调轻量级适配器,从而在提升推荐性能的同时,提高训练时间和内存使用方面的效率。