Large language models (LLMs) excel at many natural language tasks, yet their reasoning reliability under structured perturbations of rule-based systems remains brittle. We present a controlled evaluation framework consisting of four stress tests: (1) rule deletion (redundant vs. essential); (2) contradictory evidence injection; (3) logic-preserving rewrites; and (4) multi-law equivalence stacking. While representative model families (BERT, Qwen2, and TinyLlama) achieve Acc = 1.0000 on base tasks, our framework reveals a critical failure mode termed Logic Inertia - a total breakdown (Acc = 0.0000) under contradictions, where deductive momentum overrides factual reality. To resolve this, we propose Conflict-Aware Fusion, a framework grounded in the Cognitive Structure Hypothesis which posits that robust reasoning requires an explicit structural inductive bias. By imposing a dual-process architecture that separates premise verification from logical deduction, Conflict-Aware Fusion eliminates logic inertia, achieving 1.0000 accuracy on both base and contradictory stress tests, and significantly enhancing robustness to missing evidence. Our results demonstrate that, for reliable multi-step reasoning, structural verification discipline is as critical as training data scale, providing a blueprint for building robust, contradiction-aware AI systems https://github.com/14H034160212/lemo. See the OpenAI/Evals pull request https://github.com/openai/evals/pull/1622.
翻译:大语言模型(LLMs)在众多自然语言任务中表现出色,但其在基于规则系统的结构化扰动下的推理可靠性仍然脆弱。我们提出了一个包含四项压力测试的受控评估框架:(1)规则删除(冗余与必要);(2)矛盾证据注入;(3)逻辑保持重写;(4)多定律等价堆叠。虽然代表性模型家族(BERT、Qwen2和TinyLlama)在基础任务上达到准确率 = 1.0000,但我们的框架揭示了一种关键失效模式,称为逻辑惯性——在矛盾条件下完全崩溃(准确率 = 0.0000),此时演绎动量压倒了事实现实。为解决此问题,我们提出了冲突感知融合框架,该框架基于认知结构假说,该假说认为稳健推理需要明确的结构归纳偏置。通过采用将前提验证与逻辑演绎分离的双过程架构,冲突感知融合消除了逻辑惯性,在基础和矛盾压力测试中均实现了1.0000的准确率,并显著增强了对缺失证据的鲁棒性。我们的结果表明,对于可靠的多步推理,结构化验证规范与训练数据规模同等关键,为构建稳健、具备矛盾感知能力的人工智能系统提供了蓝图 https://github.com/14H034160212/lemo。参见OpenAI/Evals拉取请求 https://github.com/openai/evals/pull/1622。