Large Language Models (LLMs) are increasingly used to generate natural-language explanations in recommender systems, acting as explanation agents that reason over user behavior histories. While prior work has focused on explanation fluency and relevance under fixed inputs, the robustness of LLM-generated explanations to realistic user behavior noise remains largely unexplored. In real-world web platforms, interaction histories are inherently noisy due to accidental clicks, temporal inconsistencies, missing values, and evolving preferences, raising concerns about explanation stability and user trust. We present RobustExplain, the first systematic evaluation framework for measuring the robustness of LLM-generated recommendation explanations. RobustExplain introduces five realistic user behavior perturbations evaluated across multiple severity levels and a multi-dimensional robustness metric capturing semantic, keyword, structural, and length consistency. Our goal is to establish a principled, task-level evaluation framework and initial robustness baselines, rather than to provide a comprehensive leaderboard across all available LLMs. Experiments on four representative LLMs (7B--70B) show that current models exhibit only moderate robustness, with larger models achieving up to 8% higher stability. Our results establish the first robustness benchmarks for explanation agents and highlight robustness as a critical dimension for trustworthy, agent-driven recommender systems at web scale.
翻译:大型语言模型(LLM)在推荐系统中越来越多地用于生成自然语言解释,充当基于用户行为历史进行推理的解释代理。尽管先前的研究主要关注固定输入下的解释流畅性和相关性,但LLM生成的解释对于现实用户行为噪声的鲁棒性在很大程度上仍未得到探索。在现实网络平台中,由于意外点击、时间不一致性、缺失值以及偏好演变,交互历史本身具有噪声,这引发了对解释稳定性与用户信任的担忧。我们提出了RobustExplain,这是首个用于衡量LLM生成推荐解释鲁棒性的系统性评估框架。RobustExplain引入了五种现实用户行为扰动(在多个严重级别上进行评估)以及一个多维鲁棒性度量指标,涵盖语义、关键词、结构和长度一致性。我们的目标是建立一个原则性的、任务级的评估框架和初始鲁棒性基线,而非提供一个涵盖所有可用LLM的全面排行榜。在四个代表性LLM(7B--70B)上的实验表明,当前模型仅表现出中等程度的鲁棒性,其中较大模型的稳定性最高可提升8%。我们的研究结果为解释代理建立了首个鲁棒性基准,并强调了鲁棒性作为构建可信、代理驱动的网络规模推荐系统的关键维度。