Current 3D spatial reasoning methods face a fundamental trade-off: neuro-symbolic 3D (NS3D) concept learners achieve interpretable reasoning through compositional programs but are constrained to closed-set concept vocabularies and simple programs; end-to-end 3D multi-modal LLMs (3D MLLMs) could handle complex natural language and open-vocabulary concepts but suffer from black-box reasoning without explicit spatial verification. We introduce APEIRIA, a neuro-symbolic 3D MLLM to bridge two paradigms by distilling symbolic reasoning patterns into MLLMs with natural language chain-of-thought. Our three-stage curriculum progressively builds reasoning capabilities: a) 3D perception alignment grounds object visual-geometric features to the LLM, b) CoT-SFT teaches query decomposition and stepwise verification from symbolic program traces, and c) CoT-RL extends reasoning patterns to open-set concepts and deeply nested instructions. By transferring reasoning patterns rather than concept-specific knowledge, APEIRIA preserves key NS3D virtues: transparent reasoning and modular interchangeability of planning and perception components. Evaluations on grounding, question answering, and captioning show that APEIRIA surpasses prior NS3D methods and matches state-of-the-art 3D MLLMs on 3D spatial reasoning datasets, unifying symbolic methods' systematic reasoning with MLLMs' flexibility. Code is available at https://github.com/oceanflowlab/APEIRIA.
翻译:当前三维空间推理方法面临根本性权衡:神经符号3D概念学习器通过组合式程序实现可解释推理,但受限于封闭集概念词汇表与简单程序;端到端三维多模态大语言模型可处理复杂自然语言与开放词汇概念,但缺乏显式空间验证的黑箱推理。我们提出APEIRIA——一种通过将符号推理模式蒸馏至具备自然语言思维链的多模态大语言模型,以桥接两种范式的神经符号三维多模态大语言模型。我们的三阶段课程逐步构建推理能力:a) 三维感知对齐将物体视觉-几何特征锚定至大语言模型,b) 思维链监督微调从符号程序轨迹中学习查询分解与逐步验证,c) 思维链强化学习将推理模式扩展至开放概念与深层嵌套指令。通过迁移推理模式而非概念特定知识,APEIRIA保留了神经符号3D的关键优势:透明推理以及规划与感知组件的模块化互换性。在定位、问答与字幕生成任务上的评估表明,APEIRIA超越先前的神经符号3D方法,并在三维空间推理数据集上达到与最先进三维多模态大语言模型相当的性能,统一了符号方法的系统性推理与多模态大语言模型的灵活性。代码见https://github.com/oceanflowlab/APEIRIA。