What does it truly mean for a language model to "reason"? Current evaluations reward models' correct standalone answers-but correctness alone reveals little about the process that produced them. We argue that reasoning should be understood not as a static chain of steps but as a dynamic trajectory in which ideas interact, clash, and evolve into integrated insights. Building on the philosophical tradition of dialectics, we introduce SIEV, a structured evaluation framework that assesses reasoning through explicit thesis-antithesis-synthesis interactions. SIEV produces interpretable trajectories that highlight key properties of reasoning-robustness to challenge, adaptability under conflict, and synthesis across competing viewpoints-dimensions that conventional correctness-based metrics cannot capture. Empirical results on GSM and MMLU demonstrate substantial gaps in the reasoning abilities of state-of-the-art models: for example, GPT-5-chat loses more than 40 points (out of 100) on GSM when evaluated through SIEV's process-oriented lens. By shifting focus from what answer a model gives to how it arrives there, SIEV enables a more transparent and principled distinction between structured reasoning and surface-level pattern generation offering a clearer foundation for assessing and understanding the reasoning capabilities of LLMs.
翻译:语言模型真正意义上的“推理”意味着什么?当前评估体系仅奖励模型给出的正确答案,但正确性本身几乎无法揭示答案产生的过程。我们认为,推理不应被理解为静态的步骤链条,而应视为观点相互碰撞、冲突并最终融合为系统性认知的动态轨迹。基于哲学辩证法的传统,我们提出SIEV——一个通过显式的“正题-反题-合题”交互来评估推理能力的结构化框架。SIEV生成可解释的推理轨迹,突显推理的关键特性:对挑战的鲁棒性、冲突情境下的适应性,以及对立观点间的综合能力——这些维度是传统基于正确性的度量标准无法捕捉的。在GSM和MMLU数据集上的实证结果表明,最先进模型的推理能力存在显著缺陷:例如,通过SIEV的过程导向视角评估时,GPT-5-chat在GSM数据集上的得分(百分制)下降超过40分。通过将关注点从模型给出的答案转向其推导过程,SIEV能够在结构化推理与表层模式生成之间建立更透明、更原则性的区分,为评估和理解大语言模型的推理能力提供了更清晰的基础。