Discovering effective reward functions remains a fundamental challenge in motor control of high-dimensional musculoskeletal systems. While humans can describe movement goals explicitly such as "walking forward with an upright posture," the underlying control strategies that realize these goals are largely implicit, making it difficult to directly design rewards from high-level goals and natural language descriptions. We introduce Motion from Vision-Language Representation (MoVLR), a framework that leverages vision-language models (VLMs) to bridge the gap between goal specification and movement control. Rather than relying on handcrafted rewards, MoVLR iteratively explores the reward space through iterative interaction between control optimization and VLM feedback, aligning control policies with physically coordinated behaviors. Our approach transforms language and vision-based assessments into structured guidance for embodied learning, enabling the discovery and refinement of reward functions for high-dimensional musculoskeletal locomotion and manipulation. These results suggest that VLMs can effectively ground abstract motion descriptions in the implicit principles governing physiological motor control.
翻译:在高维肌肉骨骼系统的运动控制中,发现有效的奖励函数仍然是一个根本性挑战。虽然人类能够明确描述运动目标,例如“以直立姿势向前行走”,但实现这些目标的底层控制策略在很大程度上是隐式的,这使得难以直接从高层目标与自然语言描述中设计奖励函数。我们提出了基于视觉语言表征的运动控制框架,该框架利用视觉语言模型来弥合目标设定与运动控制之间的鸿沟。MoVLR 不依赖于手工设计的奖励函数,而是通过控制优化与VLM反馈之间的迭代交互来探索奖励空间,使控制策略与物理协调行为保持一致。我们的方法将基于语言和视觉的评估转化为具身学习的结构化指导,从而能够为高维肌肉骨骼系统的运动与操作任务发现并优化奖励函数。这些结果表明,视觉语言模型能够有效地将抽象运动描述锚定在支配生理运动控制的隐式原理之中。