Entity matching is a crucial component in various recommender systems, including conversational recommender systems (CRS) and knowledge-based recommender systems. However, the lack of rigorous evaluation frameworks for cross-dataset entity matching impedes progress in areas such as LLM-driven conversational recommendations and knowledge-grounded dataset construction. In this paper, we introduce Reddit-Amazon-EM, a novel dataset comprising naturally occurring items from Reddit and the Amazon '23 dataset. Through careful manual annotation, we identify corresponding movies across Reddit-Movies and Amazon'23, two existing recommender system datasets with inherently overlapping catalogs. Leveraging Reddit-Amazon-EM, we conduct a comprehensive evaluation of state-of-the-art entity matching methods, including rule-based, graph-based, lexical-based, embedding-based, and LLM-based approaches. For reproducible research, we release our manually annotated entity matching gold set and provide the mapping between the two datasets using the best-performing method from our experiments. This serves as a valuable resource for advancing future work on entity matching in recommender systems.
翻译:实体匹配是各类推荐系统(包括对话式推荐系统与基于知识的推荐系统)中的关键组成部分。然而,跨数据集实体匹配缺乏严谨的评估框架,阻碍了LLM驱动的对话式推荐及基于知识的数据集构建等领域的进展。本文提出了Reddit-Amazon-EM这一新颖数据集,该数据集包含来自Reddit与Amazon '23数据集的自然生成项目。通过细致的人工标注,我们在Reddit-Movies与Amazon'23这两个固有目录重叠的现有推荐系统数据集中识别出对应的电影。基于Reddit-Amazon-EM,我们对当前最先进的实体匹配方法进行了全面评估,涵盖基于规则、基于图、基于词法、基于嵌入及基于LLM的方法。为促进可重复研究,我们发布了人工标注的实体匹配基准集,并采用实验中表现最优的方法提供了两个数据集间的映射关系。这为推进推荐系统实体匹配的未来研究提供了宝贵资源。