Large Multimodal Models (LMMs) have demonstrated impressive performance across various vision and language tasks, yet their potential applications in recommendation tasks with visual assistance remain unexplored. To bridge this gap, we present a preliminary case study investigating the recommendation capabilities of GPT-4V(ison), a recently released LMM by OpenAI. We construct a series of qualitative test samples spanning multiple domains and employ these samples to assess the quality of GPT-4V's responses within recommendation scenarios. Evaluation results on these test samples prove that GPT-4V has remarkable zero-shot recommendation abilities across diverse domains, thanks to its robust visual-text comprehension capabilities and extensive general knowledge. However, we have also identified some limitations in using GPT-4V for recommendations, including a tendency to provide similar responses when given similar inputs. This report concludes with an in-depth discussion of the challenges and research opportunities associated with utilizing GPT-4V in recommendation scenarios. Our objective is to explore the potential of extending LMMs from vision and language tasks to recommendation tasks. We hope to inspire further research into next-generation multimodal generative recommendation models, which can enhance user experiences by offering greater diversity and interactivity. All images and prompts used in this report will be accessible at https://github.com/PALIN2018/Evaluate_GPT-4V_Rec.
翻译:大型多模态模型(LMMs)在各类视觉与语言任务中展现出卓越性能,但其在视觉辅助推荐任务中的潜在应用尚未得到探索。为填补这一空白,我们开展了一项初步案例研究,考察OpenAI最新发布的LMM——GPT-4V(ision)的推荐能力。我们构建了涵盖多个领域的系列定性测试样本,并利用这些样本评估GPT-4V在推荐场景中的响应质量。对测试样本的评估结果表明,凭借其强大的视觉-文本理解能力与广泛通用知识,GPT-4V在跨领域零样本推荐中表现出显著优势。然而,我们也发现将GPT-4V用于推荐时存在若干局限,例如在输入相似时倾向于生成相似响应。本报告最后深入讨论了在推荐场景中应用GPT-4V所面临的挑战与研究机遇。本研究旨在探索将LMMs从视觉与语言任务拓展至推荐任务的潜力,期望推动下一代多模态生成式推荐模型的研发,从而通过提供更高多样性与交互性来增强用户体验。本报告使用的所有图像与提示词均可在https://github.com/PALIN2018/Evaluate_GPT-4V_Rec获取。