Vision-Language-Action (VLA) models achieve strong performance on standard manipulation benchmarks, but most evaluations assume that task-relevant objects are fully visible. This assumption often fails in realistic settings, where occlusion makes manipulation partially observable. In this paper, we study \textit{scene-induced occlusion} as a fundamental challenge for VLA models and introduce \textbf{LIBERO-Occ}, an occlusion-oriented extension of LIBERO. Experiments show that state-of-the-art VLAs suffer substantial performance degradation under occlusion. To address this issue, we propose \textbf{Viewpoint Imagination (VIM)}, which generates a complementary view from an occluded primary observation and conditions action prediction on both observed and imagined evidence. VIM improves robustness across task suites, occlusion types, and severity levels without requiring additional cameras at deployment time, suggesting that viewpoint imagination is an promising mechanism for perception completion in partially observable manipulation. Our benchmark and corresponding code are available at: \href{https://github.com/litsh/Libero-Occ}{https://github.com/litsh/Libero-Occ}.
翻译:视觉-语言-动作(VLA)模型在标准操控基准测试中表现出色,但大多数评估假设任务相关物体完全可见。这一假设在现实场景中往往不成立,因为遮挡使得操控仅能部分可观测。本文研究场景诱导遮挡(scene-induced occlusion)作为VLA模型面临的根本性挑战,并提出了LIBERO的遮挡扩展版本——LIBERO-Occ。实验表明,最先进的VLA模型在遮挡条件下性能显著下降。为解决此问题,我们提出视角想象(Viewpoint Imagination,VIM),该方法从被遮挡的主观测中生成互补视角,并基于观测与想象证据共同进行动作预测。VIM在多项任务套件、遮挡类型及严重程度上提升了鲁棒性,且无需在部署时增加额外摄像头,表明视角想象是部分可观测操控中实现感知补全的一种具有前景的机制。我们的基准测试及相关代码见:\href{https://github.com/litsh/Libero-Occ}{https://github.com/litsh/Libero-Occ}。