Chain-of-thought emerges as a promising technique for eliciting reasoning capabilities from Large Language Models (LLMs). However, it does not always improve task performance or accurately represent reasoning processes, leaving unresolved questions about its usage. In this paper, we diagnose the underlying mechanism by comparing the reasoning process of LLMs with humans, using causal analysis to understand the relationships between the problem instruction, reasoning, and the answer in LLMs. Our empirical study reveals that LLMs often deviate from the ideal causal chain, resulting in spurious correlations and potential consistency errors (inconsistent reasoning and answers). We also examine various factors influencing the causal structure, finding that in-context learning with examples strengthens it, while post-training techniques like supervised fine-tuning and reinforcement learning on human feedback weaken it. To our surprise, the causal structure cannot be strengthened by enlarging the model size only, urging research on new techniques. We hope that this preliminary study will shed light on understanding and improving the reasoning process in LLM.
翻译:思维链作为一种新兴技术,在激发大语言模型的推理能力方面展现出潜力。然而,该方法并非总能提升任务性能或准确表征推理过程,其使用方式仍存在未解问题。本文通过比较大语言模型与人类的推理过程,运用因果分析方法解析大语言模型中问题指令、推理过程与答案之间的关联机制,从而诊断其内在机理。我们的实证研究表明,大语言模型经常偏离理想的因果链,导致伪相关及潜在的一致性错误(推理与答案不一致)。我们还考察了影响因果结构的多种因素,发现基于示例的上下文学习能强化该结构,而监督微调、基于人类反馈的强化学习等后训练技术则会削弱该结构。令人惊讶的是,仅通过扩大模型规模无法增强因果结构,这促使我们需要研发新技术。我们希望这项初步研究能为理解和改进大语言模型的推理过程提供启示。