Recent vision-language-action (VLA) models for multi-task robot manipulation often rely on fixed camera setups and shared visual encoders, which limit their performance under occlusions and during cross-task transfer. To address these challenges, we propose Task-aware Virtual View Exploration (TVVE), a framework that learns to select task-relevant virtual camera viewpoints and dynamically re-render observations from a reconstructed scene representation using the selected viewpoints. To enable efficient view selection, we train an exploration policy in a pseudo-environment. In addition, we introduce a Task-aware Mixture-of-Experts (TaskMoE) visual encoder that routes visual features to task-specialized experts, mitigating interference in multi-task learning. To evaluate robustness under distribution shifts, we construct RLBench-OG, an out-of-distribution benchmark with visual perturbations and camera pose variations. Experiments on RLBench and RLBench-OG demonstrate that TVVE achieves higher success rates than strong baselines, while real-robot experiments further confirm its robustness to visual disturbances and unseen instructions. Code and visualizations are available at: https://hcplab-sysu.github.io/TAVP.
翻译:当前用于多任务机器人操作的视觉-语言-行动(VLA)模型通常依赖固定的相机设置和共享的视觉编码器,这限制了其在遮挡条件下和跨任务迁移时的性能。为应对这些挑战,我们提出了任务感知虚拟视角探索(TVVE)框架,该框架学习选择与任务相关的虚拟相机视点,并利用所选视点从重建的场景表示中动态重渲染观测。为实现高效的视点选择,我们在一个伪环境中训练探索策略。此外,我们引入了一种任务感知专家混合(TaskMoE)视觉编码器,它将视觉特征路由至任务专用的专家网络,从而缓解多任务学习中的干扰。为了评估分布偏移下的鲁棒性,我们构建了RLBench-OG,这是一个包含视觉扰动和相机位姿变化的分布外基准测试。在RLBench和RLBench-OG上的实验表明,TVVE比强基线方法取得了更高的成功率,而真实机器人实验进一步证实了其对视觉干扰和未见指令的鲁棒性。代码与可视化结果发布于:https://hcplab-sysu.github.io/TAVP。