Manipulating dynamic objects remains an open challenge for Vision-Language-Action (VLA) models, which, despite strong generalization in static manipulation, struggle in dynamic scenarios requiring rapid perception, temporal anticipation, and continuous control. We present DynamicVLA, a framework for dynamic object manipulation that integrates temporal reasoning and closed-loop adaptation through three key designs: 1) a compact 0.4B VLA using a convolutional vision encoder for spatially efficient, structurally faithful encoding, enabling fast multimodal inference; 2) Continuous Inference, enabling overlapping reasoning and execution for lower latency and timely adaptation to object motion; and 3) Latent-aware Action Streaming, which bridges the perception-execution gap by enforcing temporally aligned action execution. To fill the missing foundation of dynamic manipulation data, we introduce the Dynamic Object Manipulation (DOM) benchmark, built from scratch with an auto data collection pipeline that efficiently gathers 200K synthetic episodes across 2.8K scenes and 206 objects, and enables fast collection of 2K real-world episodes without teleoperation. Extensive evaluations demonstrate remarkable improvements in response speed, perception, and generalization, positioning DynamicVLA as a unified framework for general dynamic object manipulation across embodiments.
翻译:操纵动态物体对视觉-语言-动作模型而言仍是一个开放挑战。尽管此类模型在静态操作任务中展现出强大的泛化能力,但在需要快速感知、时序预测与持续控制的动态场景中仍面临困难。本文提出DynamicVLA,这是一个面向动态物体操作的框架,通过三项关键设计整合了时序推理与闭环适应能力:1)采用卷积视觉编码器构建的紧凑型0.4B参数量VLA模型,实现空间高效且结构保真的编码,支持快速多模态推理;2)连续推理机制,通过重叠执行推理与动作以降低延迟,实现对物体运动的及时适应;3)潜在感知的动作流式执行,通过强制时序对齐的动作执行弥合感知与执行的间隙。为填补动态操作数据基础的空白,我们构建了动态物体操作基准数据集,该数据集通过自动化数据采集流程从零创建,高效收集了涵盖2.8K个场景、206类物体的20万条合成交互轨迹,并支持无需遥操作的2千条真实世界轨迹快速采集。大量实验表明,该方法在响应速度、感知能力与泛化性能上均取得显著提升,使DynamicVLA成为跨具身形态的通用动态物体操作统一框架。