ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models

Vision-Language-Action (VLA) models enable instruction-following robotic manipulation, but they are typically pretrained on 2D data and lack 3D spatial understanding. An effective approach is representation alignment, where a strong vision foundation model is used to guide a 2D VLA model. However, existing methods usually apply supervision at only a single layer, failing to fully exploit the rich information distributed across depth; meanwhile, naïve multi-layer alignment can cause gradient interference. We introduce ROCKET, a residual-oriented multi-layer representation alignment framework that formulates multi-layer alignment as aligning one residual stream to another. Concretely, ROCKET employs a shared projector to align multiple layers of the VLA backbone with multiple layers of a powerful 3D vision foundation model via a layer-invariant mapping, which reduces gradient conflicts. We provide both theoretical justification and empirical analyses showing that a shared projector is sufficient and outperforms prior designs, and further propose a Matryoshka-style sparse activation scheme for the shared projector to balance multiple alignment losses. Our experiments show that, combined with a training-free layer selection strategy, ROCKET requires only about 4% of the compute budget while achieving 98.5% state-of-the-art success rate on LIBERO. We further demonstrate the superior performance of ROCKET across LIBERO-Plus and RoboTwin, as well as multiple VLA models. The code and model weights can be found at https://github.com/CASE-Lab-UMD/ROCKET-VLA.

翻译：视觉-语言-动作（VLA）模型能够实现遵循指令的机器人操作，但它们通常在二维数据上进行预训练，缺乏三维空间理解能力。一种有效的方法是表示对齐，即利用强大的视觉基础模型来指导二维VLA模型。然而，现有方法通常仅在单一层施加监督，未能充分利用分布在深度维度上的丰富信息；同时，简单的多层对齐可能导致梯度干扰。我们提出了ROCKET，一种面向残差的多层表示对齐框架，它将多层对齐问题表述为将一个残差流与另一个残差流对齐。具体而言，ROCKET采用一个共享投影器，通过层不变映射将VLA骨干网络的多个层与强大的三维视觉基础模型的多个层进行对齐，从而减少梯度冲突。我们提供了理论论证和实证分析，表明共享投影器是充分的，并且优于先前的设计；进一步，我们为共享投影器提出了一种套娃式的稀疏激活方案，以平衡多个对齐损失。实验表明，结合免训练层选择策略，ROCKET仅需约4%的计算预算，即可在LIBERO基准上达到98.5%的最先进成功率。我们进一步在LIBERO-Plus、RoboTwin以及多个VLA模型上验证了ROCKET的卓越性能。代码和模型权重可在 https://github.com/CASE-Lab-UMD/ROCKET-VLA 获取。