Large Language Models (LLMs) have recently shown strong potential for usage in sequential recommendation tasks through text-only models, which combine advanced prompt design, contrastive alignment, and fine-tuning on downstream domain-specific data. While effective, these approaches overlook the rich visual information present in many real-world recommendation scenarios, particularly in e-commerce. This paper proposes PixRec - a vision-language framework that incorporates both textual attributes and product images into the recommendation pipeline. Our architecture leverages a vision-language model backbone capable of jointly processing image-text sequences, maintaining a dual-tower structure and mixed training objective while aligning multi-modal feature projections for both item-item and user-item interactions. Using the Amazon Reviews dataset augmented with product images, our experiments demonstrate $3\times$ and 40% improvements in top-rank and top-10 rank accuracy over text-only recommenders respectively, indicating that visual features can help distinguish items with similar textual descriptions. Our work outlines future directions for scaling multi-modal recommenders training, enhancing visual-text feature fusion, and evaluating inference-time performance. This work takes a step toward building software systems utilizing visual information in sequential recommendation for real-world applications like e-commerce.
翻译:大型语言模型(LLMs)近期通过纯文本模型在序列推荐任务中展现出强大的应用潜力,这些模型结合了先进的提示设计、对比对齐技术以及对下游领域特定数据的微调。尽管这些方法有效,但它们忽略了众多现实推荐场景(尤其是电子商务领域)中丰富的视觉信息。本文提出PixRec——一种视觉语言框架,将文本属性与商品图像同时整合到推荐流程中。我们的架构采用能够联合处理图文序列的视觉语言模型作为主干网络,保持双塔结构与混合训练目标,同时对齐面向商品-商品及用户-商品交互的多模态特征投影。通过在亚马逊评论数据集上增补商品图像进行实验,我们的方法在Top-1和Top-10排名准确率上分别较纯文本推荐系统提升$3\times$和40%,表明视觉特征有助于区分文本描述相似的商品。本研究为扩展多模态推荐系统训练、增强视觉-文本特征融合以及评估推理时性能指明了未来方向。本工作为构建利用视觉信息的序列推荐软件系统(适用于电子商务等实际应用场景)迈出了重要一步。