Achieving robot transparency is a critical step toward effective human-robot collaboration. To be transparent, a robot's natural language communication must be consistent with its actions and explicitly grounded in the task and environment. Existing hierarchical Vision-Language-Action (VLA) models can generate language (e.g., through chain-of-thought) and low-level actions. However, current work does not consider explicit alignment between these modalities during training. To address this crucial gap, we propose a novel training framework that explicitly grounds hierarchical VLA sub-task descriptions with respect to the visual observation and action space. Our framework uses a contrastive model to assess the alignment between generated language and corresponding action trajectories. This contrastive model enables direct ranking of different language-trajectory pairs based on their alignment, allowing us to refine the grounding of our hierarchical VLA through offline preference learning. We apply our framework to the LanguageTable dataset, a benchmark dataset of human language-annotated trajectories, and provide critical insights into multimodal grounding representations, all while establishing a strong baseline that achieves performance comparable to fully supervised fine-tuning and minimizing the need for costly data annotations.
翻译:实现机器人透明性是迈向有效人机协作的关键一步。为达成透明性,机器人的自然语言沟通必须与其动作一致,并明确基于任务与环境。现有分层视觉-语言-动作(VLA)模型可生成语言(如通过思维链)及低级动作。然而,当前工作未考虑训练中这些模态间的显式对齐。为填补这一关键空白,我们提出一种新型训练框架,该框架基于视觉观测与动作空间,显式夯实分层VLA子任务描述。框架采用对比模型评估生成语言与对应动作轨迹间的对齐程度。此对比模型可直接根据不同语言-轨迹对的对齐度进行排序,从而通过离线偏好学习优化分层VLA的夯实过程。我们将该框架应用于LanguageTable数据集(一个由人类语言标注轨迹构成的基准数据集),并提供关于多模态夯实表示的关键见解,同时建立了一个强劲的基线模型,其性能可与全监督微调相媲美,且最大限度减少了对昂贵数据标注的需求。