Large Language Models (LLMs) are increasingly used to generate natural-language explanations in recommender systems, acting as explanation agents that reason over user behavior histories. While prior work has focused on explanation fluency and relevance under fixed inputs, the robustness of LLM-generated explanations to realistic user behavior noise remains largely unexplored. In real-world web platforms, interaction histories are inherently noisy due to accidental clicks, temporal inconsistencies, missing values, and evolving preferences, raising concerns about explanation stability and user trust. We present RobustExplain, the first systematic evaluation framework for measuring the robustness of LLM-generated recommendation explanations. RobustExplain introduces five realistic user behavior perturbations evaluated across multiple severity levels and a multi-dimensional robustness metric capturing semantic, keyword, structural, and length consistency. Our goal is to establish a principled, task-level evaluation framework and initial robustness baselines, rather than to provide a comprehensive leaderboard across all available LLMs. Experiments on four representative LLMs (7B--70B) show that current models exhibit only moderate robustness, with larger models achieving up to 8% higher stability. Our results establish the first robustness benchmarks for explanation agents and highlight robustness as a critical dimension for trustworthy, agent-driven recommender systems at web scale.
翻译:大型语言模型(LLM)在推荐系统中越来越多地用于生成自然语言解释,充当基于用户行为历史进行推理的解释代理。尽管先前的研究主要关注固定输入下的解释流畅性和相关性,但LLM生成的解释对于真实用户行为噪声的鲁棒性在很大程度上仍未得到探索。在现实网络平台中,由于误点击、时间不一致性、缺失值以及偏好演变等因素,交互历史本身具有噪声,这引发了对解释稳定性与用户信任的担忧。本文提出RobustExplain,这是首个用于衡量LLM生成推荐解释鲁棒性的系统性评估框架。该框架引入了五种真实用户行为扰动(在不同严重程度下进行评估)以及一个多维度鲁棒性指标,涵盖语义、关键词、结构和长度一致性。我们的目标是建立一个原则性的任务级评估框架和初步的鲁棒性基线,而非提供涵盖所有可用LLM的全面排行榜。在四种代表性LLM(7B至70B参数规模)上的实验表明,当前模型仅表现出中等程度的鲁棒性,其中更大规模的模型稳定性提升最高可达8%。我们的研究结果为解释代理建立了首个鲁棒性基准,并强调鲁棒性是构建可信赖、代理驱动的网络级推荐系统的关键维度。