Vision-Language-Action (VLA) models are receiving increasing attention for their ability to enable robots to perform complex tasks by integrating visual context with linguistic commands. However, achieving efficient real-time performance remains challenging due to the high computational demands of existing models. To overcome this, we propose Dual Process VLA (DP-VLA), a hierarchical framework inspired by dual-process theory. DP-VLA utilizes a Large System 2 Model (L-Sys2) for complex reasoning and decision-making, while a Small System 1 Model (S-Sys1) handles real-time motor control and sensory processing. By leveraging Vision-Language Models (VLMs), the L-Sys2 operates at low frequencies, reducing computational overhead, while the S-Sys1 ensures fast and accurate task execution. Experimental results on the RoboCasa dataset demonstrate that DP-VLA achieves faster inference and higher task success rates, providing a scalable solution for advanced robotic applications.
翻译:视觉-语言-动作(VLA)模型因其能够整合视觉语境与语言指令使机器人执行复杂任务而受到越来越多的关注。然而,由于现有模型的高计算需求,实现高效的实时性能仍然具有挑战性。为此,我们提出双进程VLA(DP-VLA),这是一个受双进程理论启发的分层框架。DP-VLA利用大型系统2模型(L-Sys2)进行复杂推理和决策,而小型系统1模型(S-Sys1)则处理实时运动控制与感官信息处理。通过利用视觉语言模型(VLM),L-Sys2以低频率运行,降低了计算开销,而S-Sys1确保了快速准确的任务执行。在RoboCasa数据集上的实验结果表明,DP-VLA实现了更快的推理速度和更高的任务成功率,为高级机器人应用提供了一个可扩展的解决方案。