In robot learning, Vision Transformers (ViTs) are standard for visual perception, yet most methods discard valuable information by using only the final layer's features. We argue this provides an insufficient representation and propose the Vision Action Transformer (VAT), a novel architecture that is extended from ViT and unlocks the full feature hierarchy of ViT. VAT processes specialized action tokens with visual features across all transformer layers, enabling a deep and progressive fusion of perception and action generation. On a suite of simulated manipulation tasks, VAT achieves a 98.15\% average success rate across four LIBERO benchmarks, establishing a new state-of-the-art by outperforming prior methods like OpenVLA-OFT. Our work presents not only a powerful model for imitation learning but also demonstrates the critical importance of leveraging the complete ''representation trajectory'' of vision models to advance robotic policy. The GitHub URL for the project code is https://github.com/sellerbubble/VAT.
翻译:在机器人学习中,视觉Transformer(ViTs)已成为视觉感知的标准方法,然而大多数方法仅使用最终层的特征,丢弃了宝贵的信息。我们认为这提供了不充分的表征,并提出了一种新颖的架构——视觉动作Transformer(VAT),它从ViT扩展而来,并解锁了ViT的完整特征层次结构。VAT在所有Transformer层中使用视觉特征处理专门的动作令牌,实现了感知与动作生成的深度渐进式融合。在一系列模拟操作任务中,VAT在四个LIBERO基准测试中实现了98.15%的平均成功率,超越了OpenVLA-OFT等先前方法,确立了新的最先进水平。我们的工作不仅提出了一个用于模仿学习的强大模型,还证明了利用视觉模型的完整“表征轨迹”对于推进机器人策略的至关重要性。项目代码的GitHub URL为 https://github.com/sellerbubble/VAT。