Both humans and large language models (LLMs) exhibit content effects: biases in which the plausibility of the semantic content of a reasoning problem influences judgments regarding its logical validity. While this phenomenon in humans is best explained by the dual-process theory of reasoning, the mechanisms behind content effects in LLMs remain unclear. In this work, we address this issue by investigating how LLMs encode the concepts of validity and plausibility within their internal representations. We show that both concepts are linearly represented and strongly aligned in representational geometry, leading models to conflate plausibility with validity. Using steering vectors, we demonstrate that plausibility vectors can causally bias validity judgements, and vice versa, and that the degree of alignment between these two concepts predicts the magnitude of behavioral content effects across models. Finally, we construct debiasing vectors that disentangle these concepts, reducing content effects and improving reasoning accuracy. Our findings advance understanding of how abstract logical concepts are represented in LLMs and highlight representational interventions as a path toward more logical systems.
翻译:人类与大型语言模型(LLM)均表现出内容效应:即推理问题语义内容的合理性会影响对其逻辑有效性的判断,从而产生认知偏差。尽管人类中的这一现象可通过推理的双过程理论得到最佳解释,但LLM中内容效应的内在机制仍不明确。本研究通过探究LLM如何在其内部表征中编码有效性及合理性概念,以解决这一问题。我们证明这两个概念均以线性方式表征,且在表征几何中高度对齐,导致模型将合理性与有效性相混淆。利用导向向量,我们展示了合理性向量可因果性地影响有效性判断,反之亦然,且这两个概念的对齐程度可预测不同模型间行为内容效应的强度。最后,我们构建了去偏向量以分离这两个概念,从而减少内容效应并提升推理准确性。我们的发现深化了对抽象逻辑概念在LLM中表征方式的理解,并突显了表征干预作为构建更具逻辑性系统的可行路径。