The development of large vision-language models (LVLMs) offers the potential to address challenges faced by traditional multimodal recommendations thanks to their proficient understanding of static images and textual dynamics. However, the application of LVLMs in this field is still limited due to the following complexities: First, LVLMs lack user preference knowledge as they are trained from vast general datasets. Second, LVLMs suffer setbacks in addressing multiple image dynamics in scenarios involving discrete, noisy, and redundant image sequences. To overcome these issues, we propose the novel reasoning scheme named Rec-GPT4V: Visual-Summary Thought (VST) of leveraging large vision-language models for multimodal recommendation. We utilize user history as in-context user preferences to address the first challenge. Next, we prompt LVLMs to generate item image summaries and utilize image comprehension in natural language space combined with item titles to query the user preferences over candidate items. We conduct comprehensive experiments across four datasets with three LVLMs: GPT4-V, LLaVa-7b, and LLaVa-13b. The numerical results indicate the efficacy of VST.
翻译:大视觉-语言模型(LVLMs)的发展,凭借其对静态图像和文本动态信息的精准理解能力,为解决传统多模态推荐面临的挑战提供了潜力。然而,由于以下复杂性,LVLMs在该领域的应用仍十分有限:首先,LVLMs在通用大型数据集上训练,缺乏用户偏好知识;其次,LVLMs在处理涉及离散、噪声和冗余图像序列的多图像动态场景时存在不足。为克服这些问题,我们提出了一种新颖的推理方案:Rec-GPT4V,即利用大视觉-语言模型进行多模态推荐的视觉摘要思维(VST)。我们采用用户历史作为上下文用户偏好,以应对第一个挑战。接着,我们引导LVLMs生成物品图像摘要,并在自然语言空间中利用图像理解能力,结合物品标题来查询用户对候选物品的偏好。我们基于三个LVLMs(GPT4-V、LLaVa-7b、LLaVa-13b)在四个数据集上进行了全面实验。数值结果表明了VST的有效性。