Many Vision-Language-Action (VLA) models flatten image patches into a 1D token sequence, weakening the 2D spatial cues needed for precise manipulation. We introduce IVRA, a lightweight, training-free method that improves spatial understanding by exploiting affinity hints already available in the model's built-in vision encoder, without requiring any external encoder or retraining. IVRA selectively injects these affinity signals into a language-model layer in which instance-level features reside. This inference-time intervention realigns visual-token interactions and better preserves geometric structure while keeping all model parameters fixed. We demonstrate the generality of IVRA by applying it to diverse VLA architectures (LLaRA, OpenVLA, and FLOWER) across simulated benchmarks spanning both 2D and 3D manipulation (VIMA and LIBERO) and on various real-robot tasks. On 2D VIMA, IVRA improves average success by +4.2% over the baseline LLaRA in a low-data regime. On 3D LIBERO, it yields consistent gains over the OpenVLA and FLOWER baselines, including improvements when baseline accuracy is near saturation (96.3% -> 97.1). Code and visualizations are available at: jongwoopark7978.github.io/IVRA
翻译:许多视觉-语言-动作(VLA)模型将图像块展平为一维词元序列,从而削弱了精确操作所需的二维空间线索。我们提出IVRA——一种轻量级、无需训练的方法,通过利用模型内置视觉编码器中已有的亲和性提示来增强空间理解,无需任何外部编码器或重新训练。IVRA选择性地将这些亲和信号注入语言模型层中实例级特征所在的位置。这种推理时干预重新对齐视觉-词元交互,更好地保留几何结构,同时固定所有模型参数。我们通过将IVRA应用于不同VLA架构(LLaRA、OpenVLA和FLOWER),在涵盖二维和三维操作的仿真基准(VIMA和LIBERO)以及多种真实机器人任务上证明了其通用性。在二维VIMA中,IVRA在低数据场景下较基线LLaRA获得+4.2%的平均成功率提升。在三维LIBERO中,它在OpenVLA和FLOWER基线之上取得持续增益,包括基线精度接近饱和时的改进(96.3%→97.1%)。代码与可视化内容见:jongwoopark7978.github.io/IVRA