On the Factual Consistency of Text-based Explainable Recommendation Models

Text-based explainable recommendation aims to generate natural-language explanations that justify item recommendations, to improve user trust and system transparency. Although recent advances leverage LLMs to produce fluent outputs, a critical question remains underexplored: are these explanations factually consistent with the available evidence? We introduce a comprehensive framework for evaluating the factual consistency of text-based explainable recommenders. We design a prompting-based pipeline that uses LLMs to extract atomic explanatory statements from reviews, thereby constructing a ground truth that isolates and focuses on their factual content. Applying this pipeline to five categories from the Amazon Reviews dataset, we create augmented benchmarks for fine-grained evaluation of explanation quality. We further propose statement-level alignment metrics that combine LLM- and NLI-based approaches to assess both factual consistency and relevance of generated explanations. Across extensive experiments on six state-of-the-art explainable recommendation models, we uncover a critical gap: while models achieve high semantic similarity scores (BERTScore F1: 0.81-0.90), all our factuality metrics reveal alarmingly low performance (LLM-based statement-level precision: 4.38%-32.88%). These findings underscore the need for factuality-aware evaluation in explainable recommendation and provide a foundation for developing more trustworthy explanation systems.

翻译：基于文本的可解释推荐旨在生成自然语言解释来证明项目推荐的合理性，以提升用户信任和系统透明度。尽管近期研究利用大语言模型产生了流畅的输出，但一个关键问题仍未得到充分探讨：这些解释是否与现有证据在事实上保持一致？我们提出了一个用于评估基于文本的可解释推荐系统事实一致性的综合框架。我们设计了一种基于提示的流水线方法，利用大语言模型从评论中提取原子解释性陈述，从而构建一个仅关注其事实性内容的基准真相。将该流水线应用于亚马逊评论数据集中的五个类别，我们创建了用于细粒度评估解释质量的增强基准。我们进一步提出了陈述级别的对齐指标，结合基于大语言模型和自然语言推理的方法来评估生成解释的事实一致性和相关性。通过对六种最先进的可解释推荐模型进行广泛实验，我们发现了一个关键差距：尽管模型取得了较高的语义相似性得分（BERTScore F1: 0.81-0.90），但所有事实性指标均揭示了令人担忧的低性能（基于大语言模型的陈述级精确率：4.38%-32.88%）。这些发现强调了在可解释推荐中引入事实性感知评估的必要性，并为开发更可信的解释系统奠定了基础。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

面向大型语言模型推理的可信研究综述

专知会员服务

22+阅读 · 2025年9月6日

【WWW2025】G-Refer：基于图检索增强的大型语言模型用于可解释推荐

专知会员服务

13+阅读 · 2025年4月8日

【NTU博士论文】面向可信赖的推荐系统：构建可解释且无偏的推荐系统

专知会员服务

18+阅读 · 2024年10月16日

可解释图推荐系统

专知会员服务

25+阅读 · 2024年8月4日