Vision-language-action models have gained significant attention for their ability to model multimodal sequences in embodied instruction following tasks. However, most existing models rely on causal attention, which we find suboptimal for processing sequences composed of interleaved segments from different modalities. In this paper, we introduce Astra, a novel Transformer architecture featuring trajectory attention and learnable action queries, designed to efficiently process segmented multimodal trajectories and predict actions for imitation learning. Furthermore, we propose a contrastive dynamics learning objective to enhance the model's understanding of environment dynamics and multimodal alignment, complementing the primary behavior cloning objective. Through extensive experiments on three large-scale robot manipulation benchmarks, Astra demonstrates substantial performance improvements over previous models.
翻译:视觉-语言-动作模型因其在具身指令跟随任务中对多模态序列的建模能力而受到广泛关注。然而,现有模型大多依赖因果注意力机制,我们发现这对于处理由不同模态交错片段组成的序列并非最优。本文提出Astra,一种新颖的Transformer架构,其特点是采用轨迹注意力和可学习的动作查询,旨在高效处理分段的多模态轨迹并预测模仿学习所需的动作。此外,我们提出一种对比动态学习目标,以增强模型对环境动态和多模态对齐的理解,从而对主要的行为克隆目标进行补充。通过在三个大规模机器人操作基准上进行广泛实验,Astra相较于先前模型展现出显著的性能提升。