Conversational recommender systems (CRSs) integrate both recommendation and dialogue tasks, making their evaluation uniquely challenging. Existing approaches primarily assess CRS performance by separately evaluating item recommendation and dialogue management using rule-based metrics. However, these methods fail to capture the real human experience, and they cannot draw direct conclusions about the system's overall performance. As conversational recommender systems become increasingly vital in e-commerce, social media, and customer support, the ability to evaluate both recommendation accuracy and dialogue management quality using a single metric, thereby authentically reflecting user experience, has become the principal challenge impeding progress in this field. In this work, we propose a user-centric evaluation framework based on large language models (LLMs) for CRSs, namely Conversational Recommendation Evaluator (CoRE). CoRE consists of two main components: (1) LLM-As-Evaluator. Firstly, we comprehensively summarize 12 key factors influencing user experience in CRSs and directly leverage LLM as an evaluator to assign a score to each factor. (2) Multi-Agent Debater. Secondly, we design a multi-agent debate framework with four distinct roles (common user, domain expert, linguist, and HCI expert) to discuss and synthesize the 12 evaluation factors into a unified overall performance score. Furthermore, we apply the proposed framework to evaluate four CRSs on two benchmark datasets. The experimental results show that CoRE aligns well with human evaluation in most of the 12 factors and the overall assessment. Especially, CoRE's overall evaluation scores demonstrate significantly better alignment with human feedback compared to existing rule-based metrics.
翻译:对话式推荐系统(CRS)融合了推荐与对话两项任务,使其评估面临独特的挑战。现有方法主要通过基于规则的指标分别评估物品推荐和对话管理来评价CRS性能。然而,这些方法未能捕捉真实的人类体验,也无法对系统整体性能得出直接结论。随着对话式推荐系统在电子商务、社交媒体和客户支持中日益重要,如何利用单一指标同时评估推荐准确性和对话管理质量,从而真实反映用户体验,已成为阻碍该领域进展的主要挑战。本研究提出了一种基于大型语言模型(LLM)、以用户为中心的CRS评估框架,即对话式推荐评估器(CoRE)。CoRE包含两个核心组件:(1)LLM即评估器。首先,我们全面总结了影响CRS用户体验的12个关键因素,并直接利用LLM作为评估器为每个因素分配评分。(2)多智能体辩论器。其次,我们设计了一个包含四种不同角色(普通用户、领域专家、语言学家和人机交互专家)的多智能体辩论框架,通过讨论将12个评估因素合成为统一的整体性能评分。此外,我们在两个基准数据集上应用所提框架评估了四种CRS。实验结果表明,在12个因素中的大多数及整体评估上,CoRE与人工评估具有良好的一致性。特别是与现有基于规则的指标相比,CoRE的整体评估分数显示出与人类反馈显著更优的对齐性。