Conversational recommender systems (CRSs) operate under incremental preference revelation, requiring recommendation decisions under uncertainty. While recent LLM-based approaches achieve strong performance on proxy metrics such as Recall@K and BLEU, they often fail to deliver high-quality, user-aligned recommendations in practice, as they optimize intermediate objectives like retrieval accuracy or fluent generation rather than recommendation quality itself. We propose HARPO (Hierarchical Agentic Reasoning with Preference Optimization), an agentic framework that reframes conversational recommendation as a structured decision-making process optimized for multi-dimensional recommendation quality. HARPO integrates (i) hierarchical preference learning that decomposes recommendation quality into interpretable dimensions (relevance, diversity, satisfaction, and engagement) with context-dependent weighting; (ii) deliberative tree-search reasoning guided by a learned value network evaluating candidate paths on predicted quality; and (iii) domain-agnostic reasoning abstractions through Virtual Tool Operations and multi-agent refinement. We evaluate HARPO on ReDial, INSPIRED, and MUSE, demonstrating consistent improvements over strong baselines on recommendation-centric metrics while maintaining competitive response quality.
翻译:对话推荐系统(CRSs)在增量偏好揭示机制下运行,需要在不确定性中做出推荐决策。尽管近期基于大语言模型(LLM)的方法在Recall@K和BLEU等代理指标上表现强劲,但由于它们优化的是检索准确率或流畅生成等中间目标而非推荐质量本身,实际应用中往往无法提供高质量且与用户偏好对齐的推荐结果。我们提出HARPO(分层智能体推理与偏好优化)框架,该框架将对话推荐重构为针对多维推荐质量进行优化的结构化决策过程。HARPO集成了:(i)分层偏好学习——将推荐质量分解为可解释维度(相关性、多样性、满意度与参与度)并赋予上下文依赖权重;(ii)基于所学价值网络的审慎树搜索推理——该网络根据预测质量评估候选路径;(iii)通过虚拟工具操作与多智能体精炼实现的领域无关推理抽象。我们在ReDial、INSPIRED和MUSE数据集上评估HARPO,实验结果表明其在推荐核心指标上持续优于强基线方法,同时保持具有竞争力的回复质量。