Textual explanations, generated with large language models (LLMs), are increasingly used to justify recommendations. Yet, evaluating these explanations remains a critical challenge. We advocate a shift in objective: rank, don't generate. We formalize explainable recommendation as a statement-level ranking problem, where systems rank candidate explanatory statements derived from reviews and return the top-k as explanation. This formulation mitigates hallucination by construction and enables fine-grained factual analysis. It also models factor importance through relevance scores and supports standardized, reproducible evaluation with established ranking metrics. Meaningful assessment, however, requires each statement to be explanatory (item facts affecting user experience), atomic (one opinion about one aspect), and unique (paraphrases consolidated), which is challenging to obtain from noisy reviews. We address this with (i) an LLM-based extraction pipeline producing explanatory and atomic statements, and (ii) a scalable, semantic clustering method consolidating paraphrases to enforce uniqueness. Building on this pipeline, we introduce StaR, a benchmark for statement ranking in explainable recommendation, constructed from four Amazon Reviews 2014 product categories. We evaluate popularity-based baselines and state-of-the-art models under global-level (all statements) and item-level (target item statements) ranking. Popularity baselines are competitive in global-level ranking but outperform state-of-the-art models on average in item-level ranking, exposing critical limitations in personalized explanation ranking.
翻译:基于大语言模型生成的文本解释越来越多地被用于证明推荐结果的合理性。然而,评估这些解释仍然是一个关键挑战。我们主张转变目标:进行排序,而非生成。我们将可解释推荐形式化为一个语句级排序问题,系统对从评论中衍生的候选解释性语句进行排序,并返回前k个作为解释。这种形式化通过构造方式减少了幻觉,并实现了细粒度的事实分析。它还通过相关性分数对因素重要性进行建模,并支持基于既定排序指标进行标准化、可复现的评估。然而,有意义的评估要求每个语句具有解释性(影响用户体验的项目事实)、原子性(一个方面的一种观点)和唯一性(合并释义),而这从嘈杂的评论中获取具有挑战性。为此,我们提出:(i) 一个基于大语言模型的抽取流程,用于生成兼具解释性和原子性的语句;(ii) 一种可扩展的语义聚类方法,用于合并释义以确保唯一性。基于此流程,我们引入了StaR——一个面向可解释推荐中语句排序的基准数据集,该数据集基于亚马逊评论2014年数据集中的四个产品类别构建。我们在全局级别(所有语句)和项目级别(目标项目语句)排序下评估了基于流行度的基线方法和最先进模型。流行度基线在全局级排序中具有竞争力,而平均在项目级排序中优于最先进模型,这暴露了个性化解释排序中的关键局限性。