Many Vision-Language-Action (VLA) models flatten image patches into a 1D token sequence, weakening the 2D spatial cues needed for precise manipulation. We introduce IVRA, a lightweight, training-free method that improves spatial understanding by exploiting affinity hints already available in the model's built-in vision encoder, without requiring any external encoder or retraining. IVRA selectively injects these affinity signals into a language-model layer in which instance-level features reside. This inference-time intervention realigns visual-token interactions and better preserves geometric structure while keeping all model parameters fixed. We demonstrate the generality of IVRA by applying it to diverse VLA architectures (LLaRA, OpenVLA, and FLOWER) across simulated benchmarks spanning both 2D and 3D manipulation (VIMA and LIBERO) and on various real-robot tasks. On 2D VIMA, IVRA improves average success by +4.2% over the baseline LLaRA in a low-data regime. On 3D LIBERO, it yields consistent gains over the OpenVLA and FLOWER baselines, including improvements when baseline accuracy is near saturation (96.3% to 97.1%). All code and models will be released publicly. Visualizations are available at: jongwoopark7978.github.io/IVRA
翻译:许多视觉-语言-动作模型将图像块展平为一维标记序列,削弱了精确操作所需的二维空间线索。我们提出IVRA,一种轻量级、免训练的方法,通过利用模型内置视觉编码器中已有的亲和性提示来提升空间理解能力,无需任何外部编码器或重新训练。IVRA选择性地将这些亲和性信号注入到包含实例级特征的语言模型层中。这种推理时干预能重新校准视觉-标记的交互,在保持所有模型参数固定的同时更好地保留几何结构。我们通过在多样化VLA架构上应用IVRA证明了其通用性,涵盖LLaRA、OpenVLA和FLOWER等模型,并在覆盖2D与3D操作的仿真基准测试以及多种真实机器人任务中进行了验证。在2D VIMA基准上,IVRA在低数据场景下将LLaRA基线的平均成功率提升了+4.2%。在3D LIBERO基准上,该方法对OpenVLA和FLOWER基线均带来持续增益,包括在基线准确率接近饱和时仍实现提升。所有代码与模型将公开发布。可视化结果请访问:jongwoopark7978.github.io/IVRA