The recent success of large language models (LLMs) has shown great potential to develop more powerful conversational recommender systems (CRSs), which rely on natural language conversations to satisfy user needs. In this paper, we embark on an investigation into the utilization of ChatGPT for conversational recommendation, revealing the inadequacy of the existing evaluation protocol. It might over-emphasize the matching with the ground-truth items or utterances generated by human annotators, while neglecting the interactive nature of being a capable CRS. To overcome the limitation, we further propose an interactive Evaluation approach based on LLMs named iEvaLM that harnesses LLM-based user simulators. Our evaluation approach can simulate various interaction scenarios between users and systems. Through the experiments on two publicly available CRS datasets, we demonstrate notable improvements compared to the prevailing evaluation protocol. Furthermore, we emphasize the evaluation of explainability, and ChatGPT showcases persuasive explanation generation for its recommendations. Our study contributes to a deeper comprehension of the untapped potential of LLMs for CRSs and provides a more flexible and easy-to-use evaluation framework for future research endeavors. The codes and data are publicly available at https://github.com/RUCAIBox/iEvaLM-CRS.
翻译:大型语言模型(LLMs)的最新成功展示了开发更强大对话推荐系统(CRSs)的巨大潜力,这类系统依靠自然语言对话满足用户需求。本文深入探究了将ChatGPT用于对话推荐的可行性,揭示了现有评估协议存在的不足。现有协议可能过度强调与人工标注生成的真实物品或话语的匹配度,而忽视了作为优秀CRS所需的交互本质。为克服这一局限性,我们进一步提出了一种基于LLMs的交互式评估方法iEvaLM,该方法利用了基于LLM的用户模拟器。我们的评估方法能够模拟用户与系统之间的多种交互场景。在两个公开可用的CRS数据集上的实验表明,与主流评估协议相比,我们取得了显著改进。此外,我们强调可解释性评估的重要性,而ChatGPT为其推荐生成了具有说服力的解释。本研究有助于更深入地理解LLMs在CRS中未被充分利用的潜力,并为未来研究提供了更灵活易用的评估框架。相关代码和数据已公开于 https://github.com/RUCAIBox/iEvaLM-CRS。