Building AI systems that can plan, act, and create in the physical world requires more than pattern recognition. Such systems must understand the causal mechanisms and constraints governing physical processes in order to guide sequential decisions. This capability relies on internal representations, analogous to an internal language model, that relate observations, actions, and resulting environmental changes. However, many existing benchmarks treat visual perception and programmatic reasoning as separate problems, focusing either on visual recognition or on symbolic tasks. The domain of origami provides a natural testbed that integrates these modalities. Constructing shapes through folding operations requires visual perception, reasoning about geometric and physical constraints, and sequential planning, while remaining sufficiently structured for systematic evaluation. We introduce OrigamiBench, an interactive benchmark in which models iteratively propose folds and receive feedback on physical validity and similarity to a target configuration. Experiments with modern vision-language models show that scaling model size alone does not reliably produce causal reasoning about physical transformations. Models fail to generate coherent multi-step folding strategies, suggesting that visual and language representations remain weakly integrated.
翻译:构建能够在物理世界中规划、行动和创造的人工智能系统,仅靠模式识别是不够的。这类系统必须理解支配物理过程的因果机制与约束,才能指导序列决策。这种能力依赖于内部表征——类似于一种内部语言模型——来关联观察、行动以及由此产生的环境变化。然而,许多现有基准将视觉感知与程序化推理视为独立问题,要么侧重于视觉识别,要么侧重于符号任务。折纸领域提供了一个自然整合这些模态的测试平台。通过折叠操作构建形状需要视觉感知、对几何与物理约束的推理以及序列规划,同时其结构又足够规整以进行系统评估。我们提出了OrigamiBench,这是一个交互式基准测试,模型在其中迭代地提出折叠步骤,并接收关于物理有效性和与目标构型相似性的反馈。对现代视觉-语言模型的实验表明,仅扩大模型规模并不能可靠地产生关于物理变换的因果推理。模型无法生成连贯的多步骤折叠策略,这表明视觉与语言表征的整合仍然薄弱。