Large language models (LLMs) excel at natural language tasks but remain brittle in domains requiring precise logical and symbolic reasoning. Chaotic dynamical systems provide an especially demanding test because chaos is deterministic yet often misinterpreted as randomness or complexity. We introduce ChaosBench-Logic, a benchmark that evaluates LLM reasoning across 30 diverse dynamical systems using a unified first-order logic (FOL) ontology. Each system is annotated with truth assignments for 11 semantic predicates, and 621 questions are generated across seven reasoning categories, including multi-hop implications, cross-system analogies, counterfactual reasoning, bias probes, and multi-turn dialogues. We define metrics for logical accuracy, implication consistency, dialogue coherence, and contradiction, and we release an open-source evaluation pipeline. Initial experiments show that frontier LLMs such as GPT-4, Claude 3.5 Sonnet, Gemini 2.5 Flash, and the open-source LLaMA-3 70B achieve 91-94% per-item accuracy, yet still score 0% on compositional items and exhibit fragile global coherence. Dialogue-level accuracy ranges from 53.1% (GPT-4 CoT) to 75.5% (LLaMA-3 zero-shot). ChaosBench-Logic provides a rigorous testbed for diagnosing such failures and a foundation for developing neuro-symbolic approaches that improve scientific reasoning in LLMs.
翻译:大语言模型在自然语言任务上表现出色,但在需要精确逻辑与符号推理的领域中依然脆弱。混沌动力系统提供了一个尤其严苛的测试,因为混沌是确定性的,却常被误解为随机性或复杂性。我们提出了ChaosBench-Logic,这是一个使用统一的一阶逻辑本体论,在30个不同的动力系统上评估大语言模型推理能力的基准。每个系统都用11个语义谓词的真值赋值进行标注,并生成了涵盖七个推理类别(包括多步蕴含、跨系统类比、反事实推理、偏见探测和多轮对话)的621个问题。我们定义了逻辑准确性、蕴含一致性、对话连贯性和矛盾性等度量指标,并发布了一个开源评估流程。初步实验表明,前沿大语言模型(如GPT-4、Claude 3.5 Sonnet、Gemini 2.5 Flash以及开源的LLaMA-3 70B)在单项准确率上达到91-94%,但在组合项上得分仍为0%,并表现出脆弱的全局连贯性。对话级准确率范围从53.1%(GPT-4 CoT)到75.5%(LLaMA-3零样本)。ChaosBench-Logic为诊断此类失败提供了一个严格的测试平台,并为开发能提升大语言模型科学推理能力的神经符号方法奠定了基础。