In the domain of humanoid robot control, the fusion of Vision-Language-Action (VLA) with whole-body control is essential for semantically guided execution of real-world tasks. However, existing methods encounter challenges in terms of low VLA inference efficiency or an absence of effective semantic guidance for whole-body control, resulting in instability in dynamic limb-coordinated tasks. To bridge this gap, we present a semantic-motion intent guided, physics-aware multi-brain VLA framework for humanoid whole-body control. A series of experiments was conducted to evaluate the performance of the proposed framework. The experimental results demonstrated that the framework enabled reliable vision-language-guided full-body coordination for humanoid robots.
翻译:在人形机器人控制领域,将视觉-语言-动作与全身控制相融合,对于实现现实世界任务的语义引导执行至关重要。然而,现有方法在VLA推理效率低下或缺乏对全身控制的有效语义引导方面面临挑战,导致动态肢体协调任务中的不稳定性。为弥合这一差距,我们提出了一种语义-运动意图引导的、物理感知的多脑VLA框架,用于人形机器人全身控制。我们进行了一系列实验以评估所提框架的性能。实验结果表明,该框架能够为人形机器人实现可靠的视觉-语言引导全身协调。