Explainable AI (XAI) methods like SHAP and LIME produce numerical feature attributions that remain inaccessible to non expert users. Prior work has shown that Large Language Models (LLMs) can transform these outputs into natural language explanations (NLEs), but it remains unclear which factors contribute to high-quality explanations. We present a systematic factorial study investigating how Forecasting model choice, XAI method, LLM selection, and prompting strategy affect NLE quality. Our design spans four models (XGBoost (XGB), Random Forest (RF), Multilayer Perceptron (MLP), and SARIMAX - comparing black-box Machine-Learning (ML) against classical time-series approaches), three XAI conditions (SHAP, LIME, and a no-XAI baseline), three LLMs (GPT-4o, Llama-3-8B, DeepSeek-R1), and eight prompting strategies. Using G-Eval, an LLM-as-a-judge evaluation method, with dual LLM judges and four evaluation criteria, we evaluate 660 explanations for time-series forecasting. Our results suggest that: (1) XAI provides only small improvements over no-XAI baselines, and only for expert audiences; (2) LLM choice dominates all other factors, with DeepSeek-R1 outperforming GPT-4o and Llama-3; (3) we observe an interpretability paradox: in our setting, SARIMAX yielded lower NLE quality than ML models despite higher prediction accuracy; (4) zero-shot prompting is competitive with self-consistency at 7-times lower cost; and (5) chain-of-thought hurts rather than helps.
翻译:可解释人工智能(XAI)方法(如SHAP和LIME)生成的数值特征归因结果对非专业用户而言仍难以理解。已有研究表明,大型语言模型(LLM)能够将这些输出转化为自然语言解释(NLE),但哪些因素有助于生成高质量解释尚不明确。本文通过一项系统性因子研究,探究预测模型选择、XAI方法、LLM选取及提示策略如何影响NLE质量。实验设计涵盖四类模型(XGBoost (XGB)、随机森林 (RF)、多层感知机 (MLP) 和 SARIMAX——对比黑盒机器学习 (ML) 与传统时间序列方法)、三种XAI条件(SHAP、LIME及无XAI基线)、三种LLM(GPT-4o、Llama-3-8B、DeepSeek-R1)以及八种提示策略。采用基于LLM的评估方法G-Eval,通过双LLM评审与四项评估标准,对时间序列预测生成的660条解释进行质量评估。研究结果表明:(1)XAI仅能为专家受众带来有限改进,且效果微弱;(2)LLM选择是最关键影响因素,DeepSeek-R1表现优于GPT-4o与Llama-3;(3)存在可解释性悖论:在本研究设定中,SARIMAX虽具有更高预测精度,其NLE质量却低于ML模型;(4)零样本提示策略在成本降低7倍的情况下,其效果与自洽性提示相当;(5)思维链提示反而会损害解释质量。