Reliable simulation evaluation of robot manipulation policies serves as a high-fidelity proxy for real-world performance. Although existing benchmarks cover a wide range of task categories, they lack visual realism, creating a large domain gap between simulation and reality. This undermines the reliability of simulation-based evaluation in predicting real-world performance. To mitigate the sim-to-real visual gap, we conduct a systematic analysis to isolate the effects of lighting and material. Our results show that these factors play a critical role in geometric reasoning and spatial grounding, yet are largely overlooked in existing benchmarks. Motivated by the analysis, we propose VISER, a visually realistic benchmark for evaluating robot manipulation in simulation. VISER features a high-fidelity dataset of over 1,000 3D assets with physically-based rendering (PBR) materials, along with 3D scenes created from these assets through curated layouts or generation. To this end, we propose an automated pipeline leveraging Multi-modal Large Language Models (MLLMs) for material-aware part segmentation and material retrieval, enabling scalable generation of physically plausible assets. Building on the high-fidelity 3D asset dataset, we construct diverse evaluation tasks, such as grasping, placing, and long-horizon tasks, enabling scalable and reproducible assessment of Vision-Language-Action (VLA) models. Our benchmark shows a strong correlation between simulation and real-world performance, achieving an average Pearson correlation coefficient of 0.92 across different policies.
翻译:可靠的机器人操作策略仿真评估可被视为现实世界性能的高保真代理。尽管现有基准覆盖了广泛的任务类别,但它们缺乏视觉真实感,导致模拟与现实之间存在巨大的域差距。这削弱了基于仿真的评估在预测现实世界性能方面的可靠性。为缩小模拟到现实的视觉差距,我们进行了系统性分析,以分离光照和材质的影响。结果表明,这些因素在几何推理和空间定位中起着关键作用,然而在现有基准中却很大程度上被忽视。受该分析启发,我们提出了VISER——一个用于评估模拟环境中机器人操作的视觉真实感基准。VISER包含超过1,000个采用物理渲染(PBR)材质的高保真3D资产数据集,以及通过精心策划布局或生成方式利用这些资产创建的3D场景。为此,我们提出了一种利用多模态大语言模型(MLLMs)的自动化流水线,用于材质感知的部件分割和材质检索,从而实现可扩展的物理可信资产生成。基于高保真3D资产数据集,我们构建了多样化的评估任务,如抓取、放置和长时域任务,支持对视觉-语言-动作(VLA)模型进行可扩展且可重复的评估。我们的基准显示模拟性能与现实世界性能之间存在强相关性,不同策略之间的平均皮尔逊相关系数达到0.92。