Large language models (LLMs) are trained to imitate humans to explain human decisions. However, do LLMs explain themselves? Can they help humans build mental models of how LLMs process different inputs? To answer these questions, we propose to evaluate $\textbf{counterfactual simulatability}$ of natural language explanations: whether an explanation can enable humans to precisely infer the model's outputs on diverse counterfactuals of the explained input. For example, if a model answers "yes" to the input question "Can eagles fly?" with the explanation "all birds can fly", then humans would infer from the explanation that it would also answer "yes" to the counterfactual input "Can penguins fly?". If the explanation is precise, then the model's answer should match humans' expectations. We implemented two metrics based on counterfactual simulatability: precision and generality. We generated diverse counterfactuals automatically using LLMs. We then used these metrics to evaluate state-of-the-art LLMs (e.g., GPT-4) on two tasks: multi-hop factual reasoning and reward modeling. We found that LLM's explanations have low precision and that precision does not correlate with plausibility. Therefore, naively optimizing human approvals (e.g., RLHF) may not be a sufficient solution.
翻译:大型语言模型(LLMs)经过训练以模仿人类解释人类决策。然而,LLMs 是否能够自我解释?它们能否帮助人类构建关于LLMs如何处理不同输入的思维模型?为回答这些问题,我们提出评估自然语言解释的**反事实可模拟性**:即解释是否能使人类准确推断模型在解释输入的各种反事实情况下的输出。例如,若模型对输入问题“鹰能飞吗?”回答“是”,并附带解释“所有鸟都能飞”,那么人类会从该解释推断模型对反事实输入“企鹅能飞吗?”也会回答“是”。若解释精确,则模型输出应与人类预期一致。我们基于反事实可模拟性实现了两个指标:精确性和泛化性。我们利用LLMs自动生成多样化的反事实样本,并使用这些指标评估最先进的LLMs(如GPT-4)在两项任务中的表现:多跳事实推理和奖励建模。研究发现,LLMs的解释精确性较低,且精确性与可信度不相关。因此,单纯优化人类的认可(例如RLHF)可能不足以解决此问题。