From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems

Counterfactual explanations (CEs) provide an intuitive way to understand recommender systems by identifying minimal modifications to user-item interactions that alter recommendation outcomes. Existing CE methods for recommender systems, however, have been evaluated under heterogeneous protocols, using different datasets, recommenders, metrics, and even explanation formats, which hampers reproducibility and fair comparison. Our paper systematically reproduces, re-implement, and re-evaluate eleven state-of-the-art CE methods for recommender systems, covering both native explainers (e.g., LIME-RS, SHAP, PRINCE, ACCENT, LXR, GREASE) and specific graph-based explainers originally proposed for GNNs. Here, a unified benchmarking framework is proposed to assess explainers along three dimensions: explanation format (implicit vs. explicit), evaluation level (item-level vs. list-level), and perturbation scope (user interaction vectors vs. user-item interaction graphs). Our evaluation protocol includes effectiveness, sparsity, and computational complexity metrics, and extends existing item-level assessments to top-K list-level explanations. Through extensive experiments on three real-world datasets and six representative recommender models, we analyze how well previously reported strengths of CE methods generalize across diverse setups. We observe that the trade-off between effectiveness and sparsity depends strongly on the specific method and evaluation setting, particularly under the explicit format; in addition, explainer performance remains largely consistent across item level and list level evaluations, and several graph-based explainers exhibit notable scalability limitations on large recommender graphs. Our results refine and challenge earlier conclusions about the robustness and practicality of CE generation methods in recommender systems: https://github.com/L2R-UET/CFExpRec.

翻译：反事实解释通过识别改变推荐结果所需的最小用户-物品交互修改，为理解推荐系统提供了直观途径。然而，现有推荐系统反事实解释方法在不同协议下进行评估，采用不同数据集、推荐模型、评估指标甚至解释格式，严重阻碍了可复现性与公平比较。本文系统性地复现、重实现并重新评估了11种面向推荐系统的最新反事实解释方法，涵盖原生解释器（如LIME-RS、SHAP、PRINCE、ACCENT、LXR、GREASE）和最初为图神经网络设计的特定图基解释器。我们提出统一的基准测试框架，从三个维度评估解释器：解释格式（隐式与显式）、评估层级（物品级与列表级）以及扰动范围（用户交互向量与用户-物品交互图）。评估协议包含有效性、稀疏性和计算复杂度指标，并将现有物品级评估扩展至Top-K列表级解释。通过在三个真实世界数据集和六种代表性推荐模型上的大量实验，我们分析了先前报告的反事实解释方法优势在多样化配置下的泛化表现。研究发现，有效性与稀疏性之间的权衡强烈依赖于特定方法和评估设置（尤其在显式格式下）；此外，解释器在物品级和列表级评估中的性能基本保持一致，而若干图基解释器在大规模推荐图上表现出显著的可扩展性局限。本研究的结论修正并挑战了先前关于推荐系统中反事实解释生成方法鲁棒性与实用性的认识：https://github.com/L2R-UET/CFExpRec。