Recent vision-language-action (VLA) models for multi-task robotic manipulation commonly rely on static viewpoints and shared visual encoders, which limit 3D perception and cause task interference, hindering robustness and generalization. In this work, we propose Task-Aware View Planning (TAVP), a framework designed to overcome these challenges by integrating active view planning with task-specific representation learning. TAVP employs an efficient exploration policy, accelerated by a novel pseudo-environment, to actively acquire informative views. Furthermore, we introduce a Mixture-of-Experts (MoE) visual encoder to disentangle features across different tasks, boosting both representation fidelity and task generalization. By learning to see the world in a task-aware way, TAVP generates more complete and discriminative visual representations, demonstrating significantly enhanced action prediction across a wide array of manipulation challenges. Extensive experiments on RLBench tasks show that our proposed TAVP model achieves superior performance over state-of-the-art fixed-view approaches. Visual results and code are provided at: https://hcplab-sysu.github.io/TAVP.
翻译:当前用于多任务机器人操作的视觉-语言-动作(VLA)模型通常依赖静态视点和共享视觉编码器,这限制了三维感知能力并引发任务干扰,从而阻碍了模型的鲁棒性和泛化性能。本研究提出任务感知视角规划(TAVP)框架,旨在通过整合主动视角规划与任务特定表征学习来克服这些挑战。TAVP采用一种高效探索策略,并借助新型伪环境进行加速,以主动获取信息丰富的视角。此外,我们引入混合专家(MoE)视觉编码器,以解耦不同任务间的特征表示,从而提升表征保真度与任务泛化能力。通过以任务感知方式学习观察环境,TAVP能够生成更完整且更具判别性的视觉表征,在广泛的操作任务中展现出显著增强的动作预测性能。在RLBench任务上的大量实验表明,我们提出的TAVP模型在性能上显著优于当前最先进的固定视角方法。可视化结果与代码发布于:https://hcplab-sysu.github.io/TAVP。