Situated conversational recommendation (SCR), which utilizes visual scenes grounded in specific environments and natural language dialogue to deliver contextually appropriate recommendations, has emerged as a promising research direction due to its close alignment with real-world scenarios. Compared to traditional recommendations, SCR requires a deeper understanding of dynamic and implicit user preferences, as the surrounding scene often influences users' underlying interests, while both may evolve across conversations. This complexity significantly impacts the timing and relevance of recommendations. To address this, we propose situated preference reasoning (SiPeR), a novel framework that integrates two core mechanisms: (1) Scene transition estimation, which estimates whether the current scene satisfies user needs, and guides the user toward a more suitable scene when necessary; and (2) Bayesian inverse inference, which leverages the likelihood of multimodal large language models (MLLMs) to predict user preferences about candidate items within the scene. Extensive experiments on two representative benchmarks demonstrate SiPeR's superiority in both recommendation accuracy and response generation quality. The code and data are available at https://github.com/DongdingLin/SiPeR.
翻译:情境化对话推荐(Situated Conversational Recommendation, SCR)利用特定环境中的视觉场景和自然语言对话来提供情境适应的推荐,因其与真实场景高度契合而成为具有前景的研究方向。与传统推荐相比,SCR需要更深入地理解动态且隐含的用户偏好,因为周围场景常影响用户的潜在兴趣,且两者可能随对话演进。这种复杂性显著影响推荐的时机和相关性。为此,我们提出情境化偏好推理(SiPeR),一个融合两种核心机制的新框架:(1)场景转换估计,判断当前场景是否满足用户需求,并在必要时引导用户转向更合适的场景;(2)贝叶斯逆向推断,利用多模态大语言模型(MLLMs)的似然性来预测用户对场景内候选物品的偏好。在两个代表性基准上的广泛实验表明,SiPeR在推荐准确性和响应生成质量两方面均具有优越性。代码与数据见https://github.com/DongdingLin/SiPeR。