Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks. All code and data are available at https://github.com/H-EmbodVis/DOMINO.
翻译:视觉-语言-动作(VLA)模型在静态操作中表现优异,但在涉及移动目标的动态环境中存在困难。这一性能差距主要源于动态操作数据集的匮乏,以及主流VLA模型依赖单帧观测,从而限制了其时空推理能力。为解决上述问题,我们提出了DOMINO——一个面向通用动态操作的大规模数据集与基准测试平台,包含35项具有层次复杂度的任务、超过11万条专家轨迹以及多维评估套件。通过全面实验,我们系统评估了现有VLA模型在动态任务中的表现,探索了赋予模型动态感知能力的有效训练策略,并验证了动态数据的泛化性。此外,我们提出了PUMA——一种具备动态感知能力的VLA架构。通过整合以场景为中心的历史光流与专用世界查询以隐式预测以物体为中心的未来状态,PUMA将历史感知与短时域预测相结合。结果表明,PUMA实现了最先进的性能,相比于基线方法在成功率上取得了6.3%的绝对提升。同时,我们证明了在动态数据上训练能够培养鲁棒的时空表征,且这些表征可迁移至静态任务。所有代码与数据均可在https://github.com/H-EmbodVis/DOMINO获取。