Vision-Language-Action (VLA) models achieve strong performance on standard manipulation benchmarks, but most evaluations assume that task-relevant objects are fully visible. This assumption often fails in realistic settings, where occlusion makes manipulation partially observable. In this paper, we study \textit{scene-induced occlusion} as a fundamental challenge for VLA models and introduce \textbf{LIBERO-Occ}, an occlusion-oriented extension of LIBERO. Experiments show that state-of-the-art VLAs suffer substantial performance degradation under occlusion. To address this issue, we propose \textbf{Viewpoint Imagination (VIM)}, which generates a complementary view from an occluded primary observation and conditions action prediction on both observed and imagined evidence. VIM improves robustness across task suites, occlusion types, and severity levels without requiring additional cameras at deployment time, suggesting that viewpoint imagination is an promising mechanism for perception completion in partially observable manipulation. Our benchmark and corresponding code are available at: \href{https://github.com/litsh/Libero-Occ}{https://github.com/litsh/Libero-Occ}.
翻译:视觉-语言-动作(VLA)模型在标准操作基准测试中取得了强劲性能,但大多数评估假设任务相关物体完全可见。这一假设在现实场景中往往不成立——遮挡使操作过程处于部分可观测状态。本文研究作为VLA模型基础性挑战的\textit{场景诱导遮挡},并引入\textbf{LIBERO-Occ}——LIBERO的面向遮挡扩展。实验表明,先进VLA模型在遮挡条件下性能显著下降。针对该问题,我们提出\textbf{视角想象(VIM)},该方法从被遮挡的主视角观测生成互补视图,并基于观测证据与想象证据联合预测动作。VIM在多种任务套件、遮挡类型及严重程度上均提升了鲁棒性,且无需在部署时增设额外摄像头,表明视角想象是部分可观测操作场景中一种有前景的感知补全机制。我们的基准测试集及对应代码已开源:\href{https://github.com/litsh/Libero-Occ}{https://github.com/litsh/Libero-Occ}。