With the rise of Large Language Models (LLMs) such as ChatGPT, researchers have been working on how to utilize the LLMs for better recommendations. However, although LLMs exhibit black-box and probabilistic characteristics (meaning their internal working is not visible), the evaluation framework used for assessing these LLM-based recommender systems (RS) are the same as those used for traditional recommender systems. To address this gap, we introduce the metamorphic testing for the evaluation of GPT-based RS. This testing technique involves defining of metamorphic relations (MRs) between the inputs and checking if the relationship has been satisfied in the outputs. Specifically, we examined the MRs from both RS and LLMs perspectives, including rating multiplication/shifting in RS and adding spaces/randomness in the LLMs prompt via prompt perturbation. Similarity metrics (e.g. Kendall tau and Ranking Biased Overlap(RBO)) are deployed to measure whether the relationship has been satisfied in the outputs of MRs. The experiment results on MovieLens dataset with GPT3.5 show that lower similarity are obtained in terms of Kendall $\tau$ and RBO, which concludes that there is a need of a comprehensive evaluation of the LLM-based RS in addition to the existing evaluation metrics used for traditional recommender systems.
翻译:随着ChatGPT等大型语言模型(LLMs)的兴起,研究人员一直在探索如何利用LLMs实现更好的推荐效果。然而,尽管LLMs具有黑盒性和概率性特征(即其内部工作机制不可见),当前用于评估这些基于LLM的推荐系统(RS)的框架仍与传统推荐系统相同。为弥补这一不足,我们引入蜕变测试技术来评估基于GPT的推荐系统。该测试技术通过定义输入之间的蜕变关系(MRs),并验证输出是否满足这种关系。具体而言,我们从推荐系统和大型语言模型两个维度考察蜕变关系:包括推荐系统中的评分乘性/平移变换,以及通过提示扰动在LLMs提示中添加空格/随机性。我们采用相似性度量指标(如肯德尔τ系数和排序偏置重叠度RBO)来量化蜕变关系输出是否满足预期关系。在MovieLens数据集上使用GPT3.5的实验结果表明,肯德尔τ系数和RBO指标均呈现较低相似度,这证明除了传统推荐系统现有评估指标外,还需要对基于LLM的推荐系统建立更全面的评估体系。