Embracing Uncertainty: Adaptive Vague Preference Policy Learning for Multi-round Conversational Recommendation

Conversational recommendation systems (CRS) effectively address information asymmetry by dynamically eliciting user preferences through multi-turn interactions. Existing CRS widely assumes that users have clear preferences. Under this assumption, the agent will completely trust the user feedback and treat the accepted or rejected signals as strong indicators to filter items and reduce the candidate space, which may lead to the problem of over-filtering. However, in reality, users' preferences are often vague and volatile, with uncertainty about their desires and changing decisions during interactions. To address this issue, we introduce a novel scenario called Vague Preference Multi-round Conversational Recommendation (VPMCR), which considers users' vague and volatile preferences in CRS.VPMCR employs a soft estimation mechanism to assign a non-zero confidence score for all candidate items to be displayed, naturally avoiding the over-filtering problem. In the VPMCR setting, we introduce an solution called Adaptive Vague Preference Policy Learning (AVPPL), which consists of two main components: Uncertainty-aware Soft Estimation (USE) and Uncertainty-aware Policy Learning (UPL). USE estimates the uncertainty of users' vague feedback and captures their dynamic preferences using a choice-based preferences extraction module and a time-aware decaying strategy. UPL leverages the preference distribution estimated by USE to guide the conversation and adapt to changes in users' preferences to make recommendations or ask for attributes. Our extensive experiments demonstrate the effectiveness of our method in the VPMCR scenario, highlighting its potential for practical applications and improving the overall performance and applicability of CRS in real-world settings, particularly for users with vague or dynamic preferences.

翻译：对话推荐系统通过多轮交互动态获取用户偏好，有效缓解了信息不对称问题。现有系统普遍假设用户具有清晰偏好。在此假设下，系统会完全信任用户反馈，将接受或拒绝信号视为强指示以筛选物品并缩小候选空间，这可能导致过度过滤问题。然而现实中，用户偏好往往模糊且易变，既存在自身需求的不确定性，又会在交互过程中发生动态变化。针对该问题，我们引入名为"模糊偏好多轮对话推荐"的新场景，该场景在对话推荐系统中考虑用户模糊且易变的偏好，采用软估计机制为所有候选物品赋予非零置信度，从而自然规避过度过滤问题。在模糊偏好多轮对话推荐设定下，我们提出"自适应模糊偏好策略学习"方案，包含两大核心组件：不确定性感知软估计模块和不确定性感知策略学习模块。前者通过基于选择的偏好提取模块与时序衰减策略，估计用户模糊反馈的不确定性并捕捉其动态偏好；后者利用软估计模块得到的偏好分布引导对话进程，根据用户偏好变化自适应调整推荐或属性询问策略。大量实验证明，我们的方法在模糊偏好多轮对话推荐场景中具有显著效果，凸显了实际应用潜力，尤其能为具有模糊或动态偏好的用户提升对话推荐系统在现实场景中的整体性能与适用性。