Large Language Models (LLMs) can produce verbalized self-explanations, yet prior studies suggest that such rationales may not reliably reflect the model's true decision process. We ask whether these explanations nevertheless help users predict model behavior, operationalized as counterfactual simulatability. Using StrategyQA, we evaluate how well humans and LLM judges can predict a model's answers to counterfactual follow-up questions, with and without access to the model's chain-of-thought or post-hoc explanations. We compare LLM-generated counterfactuals with pragmatics-based perturbations as alternative ways to construct test cases for assessing the potential usefulness of explanations. Our results show that self-explanations consistently improve simulation accuracy for both LLM judges and humans, but the degree and stability of gains depend strongly on the perturbation strategy and judge strength. We also conduct a qualitative analysis of free-text justifications written by human users when predicting the model's behavior, which provides evidence that access to explanations helps humans form more accurate predictions on the perturbed questions.
翻译:大型语言模型(LLMs)能够生成言语化的自解释,然而先前研究表明此类推理依据可能无法可靠反映模型的真实决策过程。我们探讨这些解释是否仍有助于用户预测模型行为,具体通过反事实可模拟性这一操作化指标进行评估。基于StrategyQA数据集,我们评估了人类和LLM评估者在获取模型的思维链解释或事后解释的情况下,对模型在反事实后续问题中答案的预测能力。我们将LLM生成的反事实与基于语用学的扰动方法进行比较,作为评估解释潜在实用性的两种替代测试案例构建方式。实验结果表明,自解释能持续提升LLM评估者和人类的模拟准确率,但增益程度和稳定性高度依赖于扰动策略和评估者能力强度。此外,我们对人类用户在预测模型行为时撰写的自由文本理由进行了定性分析,发现获取解释能帮助人类在扰动问题上形成更准确的预测。