Learn2Fold: Structured Origami Generation with World Model Planning

The ability to transform a flat sheet into a complex three-dimensional structure is a fundamental test of physical intelligence. Unlike cloth manipulation, origami is governed by strict geometric axioms and hard kinematic constraints, where a single invalid crease or collision can invalidate the entire folding sequence. As a result, origami demands long-horizon constructive reasoning that jointly satisfies precise physical laws and high-level semantic intent. Existing approaches fall into two disjoint paradigms: optimization-based methods enforce physical validity but require dense, precisely specified inputs, making them unsuitable for sparse natural language descriptions, while generative foundation models excel at semantic and perceptual synthesis yet fail to produce long-horizon, physics-consistent folding processes. Consequently, generating valid origami folding sequences directly from text remains an open challenge. To address this gap, we introduce Learn2Fold, a neuro-symbolic framework that formulates origami folding as conditional program induction over a crease-pattern graph. Our key insight is to decouple semantic proposal from physical verification. A large language model generates candidate folding programs from abstract text prompts, while a learned graph-structured world model serves as a differentiable surrogate simulator that predicts physical feasibility and failure modes before execution. Integrated within a lookahead planning loop, Learn2Fold enables robust generation of physically valid folding sequences for complex and out-of-distribution patterns, demonstrating that effective spatial intelligence arises from the synergy between symbolic reasoning and grounded physical simulation.

翻译：将平面薄片转化为复杂三维结构的能力，是物理智能的核心考验。与布料操作不同，折纸受严格的几何公理和刚性运动学约束，单个无效折痕或碰撞即可使整个折叠序列失效。因此，折纸需要长期跨度的建构性推理，同时满足精确物理定律与高层语义意图。现有方法分为两类互不兼容的范式：基于优化的方法能确保物理有效性，但需稠密且精确的输入，无法适配稀疏的自然语言描述；而生成式基础模型虽擅长语义与感知合成，却难以生成符合物理规律的长期折叠过程。由此，直接从文本生成有效折纸序列仍是开放挑战。为解决此问题，我们提出Learn2Fold——一种将折纸建模为基于折痕图的条件程序归纳的神经符号框架。核心思路在于将语义提议与物理验证解耦：大语言模型根据抽象文本提示生成候选折叠程序，而可学习的图结构世界模型作为可微分代理仿真器，在执行前预判物理可行性及失效模式。通过集成前瞻规划循环，Learn2Fold能鲁棒地生成复杂与分布外模式的物理有效折叠序列，表明有效空间智能源于符号推理与具身物理仿真的协同。