The recent success of large language models (LLMs) has shown great potential to develop more powerful conversational recommender systems (CRSs), which rely on natural language conversations to satisfy user needs. In this paper, we embark on an investigation into the utilization of ChatGPT for conversational recommendation, revealing the inadequacy of the existing evaluation protocol. It might over-emphasize the matching with the ground-truth items or utterances generated by human annotators, while neglecting the interactive nature of being a capable CRS. To overcome the limitation, we further propose an interactive Evaluation approach based on LLMs named iEvaLM that harnesses LLM-based user simulators. Our evaluation approach can simulate various interaction scenarios between users and systems. Through the experiments on two publicly available CRS datasets, we demonstrate notable improvements compared to the prevailing evaluation protocol. Furthermore, we emphasize the evaluation of explainability, and ChatGPT showcases persuasive explanation generation for its recommendations. Our study contributes to a deeper comprehension of the untapped potential of LLMs for CRSs and provides a more flexible and easy-to-use evaluation framework for future research endeavors. The codes and data are publicly available at https://github.com/RUCAIBox/iEvaLM-CRS.
翻译:大语言模型(LLM)的最新成功展现出开发更强大对话推荐系统(CRS)的巨大潜力,这类系统依赖自然语言对话来满足用户需求。本文系统探究了ChatGPT在对话推荐中的应用,揭示了现有评估协议存在的不足:它可能过度强调与人类标注者生成的真实物品或话语的匹配度,却忽视了作为高效CRS所应具备的交互本质。为克服这一局限性,我们进一步提出一种基于LLM的交互式评估方法iEvaLM,该方法利用基于LLM的用户模拟器。我们的评估方法能够模拟用户与系统之间的多种交互场景。通过在两个公开CRS数据集上的实验,我们证明该方法相较于主流评估协议具有显著改进。此外,我们重点关注可解释性评估,实验表明ChatGPT能为其推荐生成具有说服力的解释。本研究有助于深入理解LLM在CRS中尚未被充分发掘的潜力,并为未来研究提供了更灵活易用的评估框架。相关代码与数据已开源发布于https://github.com/RUCAIBox/iEvaLM-CRS。