As robots are expected to perform increasingly diverse tasks, they must understand not only low-level actions but also the higher-level structure that determines how a task should unfold. Existing vision-language-action (VLA) models struggle with this form of task-level reasoning. They either depend on prompt-based in-context decomposition, which is unstable and sensitive to linguistic variations, or end-to-end long-horizon training, which requires large-scale demonstrations and entangles task-level reasoning with low-level control. We present in-parameter structured task reasoning (iSTAR), a framework for enhancing VLA models via functional differentiation induced by in-parameter structural reasoning. Instead of treating VLAs as monolithic policies, iSTAR embeds task-level semantic structure directly into model parameters, enabling differentiated task-level inference without external planners or handcrafted prompt inputs. This injected structure takes the form of implicit dynamic scene-graph knowledge that captures object relations, subtask semantics, and task-level dependencies in parameter space. Across diverse manipulation benchmarks, iSTAR achieves more reliable task decompositions and higher success rates than both in-context and end-to-end VLA baselines, demonstrating the effectiveness of parameter-space structural reasoning for functional differentiation and improved generalization across task variations.
翻译:随着机器人被期望执行日益多样化的任务,它们不仅需要理解低层动作,还必须掌握决定任务应如何展开的高层结构。现有的视觉-语言-动作模型在此类任务级推理方面存在明显不足:它们要么依赖基于提示的上下文分解方法(该方法不稳定且对语言变化敏感),要么采用端到端的长时程训练(需要大规模演示数据并将任务级推理与低层控制相耦合)。本文提出参数内结构化任务推理框架,该框架通过参数内结构推理诱导的功能分化来增强VLA模型。与将VLA视为单一策略的传统方法不同,iSTAR将任务级语义结构直接嵌入模型参数,实现无需外部规划器或人工提示输入的分化式任务级推理。这种注入的结构以隐式动态场景图知识的形式存在,在参数空间中捕获对象关系、子任务语义及任务级依赖关系。在多样化操作基准测试中,iSTAR相比基于上下文和端到端的VLA基线方法,实现了更可靠的任务分解和更高的成功率,证明了参数空间结构推理对于功能分化及提升任务变体泛化能力的有效性。