Understanding the extent to which Chain-of-Thought (CoT) generations align with a large language model's (LLM) internal computations is critical for deciding whether to trust an LLM's output. As a proxy for CoT faithfulness, Lanham et al. (2023) propose a metric that measures a model's dependence on its CoT for producing an answer. Within a single family of proprietary models, they find that LLMs exhibit a scaling-then-inverse-scaling relationship between model size and their measure of faithfulness, and that a 13 billion parameter model exhibits increased faithfulness compared to models ranging from 810 million to 175 billion parameters in size. We evaluate whether these results generalize as a property of all LLMs. We replicate the experimental setup in their section focused on scaling experiments with three different families of models and, under specific conditions, successfully reproduce the scaling trends for CoT faithfulness they report. However, after normalizing the metric to account for a model's bias toward certain answer choices, unfaithfulness drops significantly for smaller less-capable models. This normalized faithfulness metric is also strongly correlated ($R^2$=0.74) with accuracy, raising doubts about its validity for evaluating faithfulness.
翻译:理解思维链(CoT)生成与大型语言模型(LLM)内部计算的对齐程度,对于决定是否信任LLM的输出至关重要。作为CoT忠实性的代理指标,Lanham等人(2023)提出了一种度量方法,用于衡量模型生成答案时对其自身CoT的依赖程度。在单一系列的专有模型家族中,他们发现LLM在模型规模与其忠实性度量之间呈现出先增后减的缩放关系,并且一个130亿参数的模型相较于8.1亿至1750亿参数规模的模型表现出更高的忠实性。我们评估这些结果是否能够推广为所有LLM的普遍属性。我们复制了其专注于缩放实验部分的实验设置,并在三个不同的模型家族中进行验证;在特定条件下,我们成功复现了他们所报告的CoT忠实性缩放趋势。然而,在通过归一化该度量以消除模型对特定答案选项的偏好后,较小且能力较弱的模型的不忠实性显著下降。这一归一化后的忠实性度量也与模型准确率高度相关($R^2$=0.74),从而对其作为评估忠实性指标的有效性提出了质疑。