Vision-Language-Action (VLA) models have shown remarkable progress in embodied tasks recently, but most methods process visual observations independently at each timestep. This history-agnostic design treats robot manipulation as a Markov Decision Process, even though real-world robotic control is inherently partially observable and requires reasoning over past interactions. To address this mismatch, we reformulate VLA policy learning from a Partially Observable Markov Decision Process perspective and propose AVA-VLA, a framework that conditions action generation on a recurrent state that serves as a neural approximation to the agent's belief over task history. Built on this recurrent state, we introduce Active Visual Attention (AVA), which dynamically reweights visual tokens in the current observation to focus on regions most relevant given both the instruction and execution history. Extensive experiments show that AVA-VLA achieves state-of-the-art performance on standard robotic benchmarks, including LIBERO and CALVIN, and transfers effectively to real-world dual-arm manipulation tasks. These results demonstrate the effectiveness of temporally grounded active visual processing for improving VLA performance in robotic sequential decision-making. The project page is available at https://liauto-dsr.github.io/AVA-VLA-Page.
翻译:视觉-语言-动作模型近期在具身任务中展现出显著进展,但多数方法在每个时间步独立处理视觉观测。这种忽略历史信息的设计将机器人操控视为马尔可夫决策过程,而实际机器人控制本质上具有部分可观测性,需要基于历史交互进行推理。为解决这一不匹配问题,我们以部分可观测马尔可夫决策过程的视角重新构建VLA策略学习,并提出AVA-VLA框架——该框架通过一个循环状态来条件化动作生成,该状态作为智能体对任务历史信念的神经逼近。基于此循环状态,我们引入主动视觉注意力机制,该机制能动态重新加权当前观测中的视觉标记,聚焦于与指令及执行历史最相关的区域。大量实验表明,AVA-VLA在LIBERO和CALVIN等标准机器人基准上实现了最先进性能,并能有效迁移至真实双臂操控任务。这些结果证明了基于时间上下文的主动视觉处理在提升机器人顺序决策中VLA性能的有效性。项目页面:https://liauto-dsr.github.io/AVA-VLA-Page。