Entity matching is a crucial component in various recommender systems, including conversational recommender systems (CRS) and knowledge-based recommender systems. However, the lack of rigorous evaluation frameworks for cross-dataset entity matching impedes progress in areas such as LLM-driven conversational recommendations and knowledge-grounded dataset construction. In this paper, we introduce Reddit-Amazon-EM, a novel dataset comprising naturally occurring items from Reddit and the Amazon '23 dataset. Through careful manual annotation, we identify corresponding movies across Reddit-Movies and Amazon'23, two existing recommender system datasets with inherently overlapping catalogs. Leveraging Reddit-Amazon-EM, we conduct a comprehensive evaluation of state-of-the-art entity matching methods, including rule-based, graph-based, lexical-based, embedding-based, and LLM-based approaches. For reproducible research, we release our manually annotated entity matching gold set and provide the mapping between the two datasets using the best-performing method from our experiments. This serves as a valuable resource for advancing future work on entity matching in recommender systems.Data and Code are accessible at: https://github.com/huang-zihan/Reddit-Amazon-Entity-Matching.
翻译:实体匹配是各类推荐系统(包括对话式推荐系统和基于知识的推荐系统)中的关键组件。然而,跨数据集实体匹配领域缺乏严谨的评估框架,阻碍了LLM驱动的对话式推荐和基于知识的数据集构建等领域的进展。本文介绍了Reddit-Amazon-EM,这是一个新颖的数据集,包含来自Reddit和Amazon '23数据集的自然出现项目。通过细致的人工标注,我们在Reddit-Movies和Amazon'23这两个固有目录重叠的现有推荐系统数据集中识别出对应的电影。利用Reddit-Amazon-EM,我们对最先进的实体匹配方法进行了全面评估,包括基于规则、基于图、基于词汇、基于嵌入和基于LLM的方法。为促进可重复研究,我们发布了人工标注的实体匹配黄金标准集,并使用实验中表现最佳的方法提供了两个数据集之间的映射关系。这为推进推荐系统中实体匹配的未来研究提供了宝贵资源。数据与代码可通过以下链接获取:https://github.com/huang-zihan/Reddit-Amazon-Entity-Matching。