Adaptive Vague Preference Policy Learning for Multi-round Conversational Recommendation

Conversational recommendation systems (CRS) effectively address information asymmetry by dynamically eliciting user preferences through multi-turn interactions. Existing CRS widely assumes that users have clear preferences. Under this assumption, the agent will completely trust the user feedback and treat the accepted or rejected signals as strong indicators to filter items and reduce the candidate space, which may lead to the problem of over-filtering. However, in reality, users' preferences are often vague and volatile, with uncertainty about their desires and changing decisions during interactions. To address this issue, we introduce a novel scenario called Vague Preference Multi-round Conversational Recommendation (VPMCR), which considers users' vague and volatile preferences in CRS.VPMCR employs a soft estimation mechanism to assign a non-zero confidence score for all candidate items to be displayed, naturally avoiding the over-filtering problem. In the VPMCR setting, we introduce an solution called Adaptive Vague Preference Policy Learning (AVPPL), which consists of two main components: Uncertainty-aware Soft Estimation (USE) and Uncertainty-aware Policy Learning (UPL). USE estimates the uncertainty of users' vague feedback and captures their dynamic preferences using a choice-based preferences extraction module and a time-aware decaying strategy. UPL leverages the preference distribution estimated by USE to guide the conversation and adapt to changes in users' preferences to make recommendations or ask for attributes. Our extensive experiments demonstrate the effectiveness of our method in the VPMCR scenario, highlighting its potential for practical applications and improving the overall performance and applicability of CRS in real-world settings, particularly for users with vague or dynamic preferences.

翻译：对话式推荐系统（CRS）通过多轮交互动态获取用户偏好，有效解决了信息不对称问题。现有CRS普遍假设用户具有明确偏好。在此假设下，系统将完全信任用户反馈，并将接受或拒绝信号视为强指标以过滤物品并缩减候选空间，这可能导致过度过滤问题。然而，现实中用户的偏好往往模糊且易变，其在交互过程中存在对需求的不确定性以及决策的变化。为解决该问题，我们引入一种名为"模糊偏好多轮对话式推荐"（VPMCR）的新场景，该场景在CRS中考虑了用户模糊且易变的偏好。VPMCR采用软估计机制为所有待展示候选物品分配非零置信度分数，从而自然避免过度过滤问题。在VPMCR设置下，我们提出一种名为"自适应模糊偏好策略学习"（AVPPL）的解决方案，该方案包含两大核心组件：不确定性感知软估计（USE）和不确定性感知策略学习（UPL）。USE通过基于选择的偏好提取模块与时序衰减策略，估计用户模糊反馈的不确定性并捕获其动态偏好；UPL则利用USE估计的偏好分布引导对话过程，通过适应用户偏好变化来执行推荐或属性询问。大量实验证明了我们方法在VPMCR场景中的有效性，突显了其在现实环境中（尤其针对具有模糊或动态偏好的用户）的实际应用潜力，以及提升CRS整体性能与适用性的价值。