In robotics, Vision-Language-Action (VLA) models that integrate diverse multimodal signals from multi-view inputs have emerged as an effective approach. However, most prior work adopts static fusion that processes all visual inputs uniformly, which incurs unnecessary computational overhead and allows task-irrelevant background information to act as noise. Inspired by the principles of human active perception, we propose a dynamic information fusion framework designed to maximize the efficiency and robustness of VLA models. Our approach introduces a lightweight adaptive routing architecture that analyzes the current text prompt and observations from a wrist-mounted camera in real-time to predict the task-relevance of multiple camera views. By conditionally attenuating computations for views with low informational utility and selectively providing only essential visual features to the policy network, Our framework achieves computation efficiency proportional to task relevance. Furthermore, to efficiently secure large-scale annotation data for router training, we established an automated labeling pipeline utilizing Vision-Language Models (VLMs) to minimize data collection and annotation costs. Experimental results in real-world robotic manipulation scenarios demonstrate that the proposed approach achieves significant improvements in both inference efficiency and control performance compared to existing VLA models, validating the effectiveness and practicality of dynamic information fusion in resource-constrained, real-time robot control environments.
翻译:在机器人学领域,整合多视角输入中多样化多模态信号的视觉-语言-动作(VLA)模型已成为一种有效方法。然而,现有研究大多采用静态融合策略,对所有视觉输入进行统一处理,这不仅导致不必要的计算开销,还使任务无关的背景信息成为噪声源。受人类主动感知原理启发,我们提出了一种动态信息融合框架,旨在最大化VLA模型的效率与鲁棒性。该方法引入轻量级自适应路由架构,通过实时分析当前文本指令及腕戴式摄像头观测数据,预测多摄像头视角的任务相关性。通过条件性衰减低信息效用视角的计算量,并选择性地仅向策略网络提供关键视觉特征,本框架实现了与任务相关性成比例的计算效率。此外,为高效获取路由器训练所需的大规模标注数据,我们建立了利用视觉-语言模型(VLM)的自动化标注流程,以最小化数据收集与标注成本。在真实机器人操作场景中的实验结果表明,相较于现有VLA模型,所提方法在推理效率与控制性能上均取得显著提升,验证了动态信息融合在资源受限的实时机器人控制环境中的有效性与实用性。