Conversational recommender systems (CRS) involve both recommendation and dialogue tasks, which makes their evaluation a unique challenge. Although past research has analyzed various factors that may affect user satisfaction with CRS interactions from the perspective of user studies, few evaluation metrics for CRS have been proposed. Recent studies have shown that LLMs can align with human preferences, and several LLM-based text quality evaluation measures have been introduced. However, the application of LLMs in CRS evaluation remains relatively limited. To address this research gap and advance the development of user-centric conversational recommender systems, this study proposes an automated LLM-based CRS evaluation framework, building upon existing research in human-computer interaction and psychology. The framework evaluates CRS from four dimensions: dialogue behavior, language expression, recommendation items, and response content. We use this framework to evaluate four different conversational recommender systems.
翻译:对话推荐系统(CRS)同时涉及推荐与对话任务,这使其评估面临独特挑战。尽管已有研究从用户研究视角分析了可能影响用户对CRS交互满意度的多种因素,但针对CRS的评估指标仍鲜有提出。近期研究表明,大语言模型(LLM)能够与人类偏好对齐,且已有若干基于LLM的文本质量评估方法被提出。然而,LLM在CRS评估中的应用仍相对有限。为填补这一研究空白并推动以用户为中心的对话推荐系统发展,本研究基于人机交互与心理学领域现有成果,提出了一种基于LLM的自动化CRS评估框架。该框架从四个维度评估CRS:对话行为、语言表达、推荐项目与回复内容。我们运用该框架对四种不同的对话推荐系统进行了评估。