The self-rationalising capabilities of LLMs are appealing because the generated explanations can give insights into the plausibility of the predictions. However, how faithful the explanations are to the predictions is questionable, raising the need to explore the patterns behind them further. To this end, we propose a hypothesis-driven statistical framework. We use a Bayesian network to implement a hypothesis about how a task (in our example, natural language inference) is solved, and its internal states are translated into natural language with templates. Those explanations are then compared to LLM-generated free-text explanations using automatic and human evaluations. This allows us to judge how similar the LLM's and the Bayesian network's decision processes are. We demonstrate the usage of our framework with an example hypothesis and two realisations in Bayesian networks. The resulting models do not exhibit a strong similarity to GPT-3.5. We discuss the implications of this as well as the framework's potential to approximate LLM decisions better in future work.
翻译:大语言模型(LLM)的自解释能力颇具吸引力,因为生成的解释能揭示预测结果的合理性。然而,这些解释对预测的忠实度存疑,亟需进一步挖掘其背后的模式。为此,我们提出一种假设驱动的统计框架:采用贝叶斯网络实现关于任务解决方式的假设(以自然语言推理为例),将其内部状态转化为基于模板的自然语言解释;随后通过自动评估与人工评估,将这些解释与LLM生成的自由文本解释进行对比。该方法使我们能够判断LLM与贝叶斯网络的决策过程的相似程度。我们以一项示例假设及其在贝叶斯网络中的两种实现方式演示该框架的应用。结果显示,所得模型与GPT-3.5的相似性较弱。本文讨论了这一结果的意义,以及该框架在未来研究中更精确逼近LLM决策机制的潜力。