Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation

Reliable simulation evaluation of robot manipulation policies serves as a high-fidelity proxy for real-world performance. Although existing benchmarks cover a wide range of task categories, they lack visual realism, creating a large domain gap between simulation and reality. This undermines the reliability of simulation-based evaluation in predicting real-world performance. To mitigate the sim-to-real visual gap, we conduct a systematic analysis to isolate the effects of lighting and material. Our results show that these factors play a critical role in geometric reasoning and spatial grounding, yet are largely overlooked in existing benchmarks. Motivated by the analysis, we propose VISER, a visually realistic benchmark for evaluating robot manipulation in simulation. VISER features a high-fidelity dataset of over 1,000 3D assets with physically-based rendering (PBR) materials, along with 3D scenes created from these assets through curated layouts or generation. To this end, we propose an automated pipeline leveraging Multi-modal Large Language Models (MLLMs) for material-aware part segmentation and material retrieval, enabling scalable generation of physically plausible assets. Building on the high-fidelity 3D asset dataset, we construct diverse evaluation tasks, such as grasping, placing, and long-horizon tasks, enabling scalable and reproducible assessment of Vision-Language-Action (VLA) models. Our benchmark shows a strong correlation between simulation and real-world performance, achieving an average Pearson correlation coefficient of 0.92 across different policies.

翻译：可靠的机器人操作策略仿真评估可被视为现实世界性能的高保真代理。尽管现有基准覆盖了广泛的任务类别，但它们缺乏视觉真实感，导致模拟与现实之间存在巨大的域差距。这削弱了基于仿真的评估在预测现实世界性能方面的可靠性。为缩小模拟到现实的视觉差距，我们进行了系统性分析，以分离光照和材质的影响。结果表明，这些因素在几何推理和空间定位中起着关键作用，然而在现有基准中却很大程度上被忽视。受该分析启发，我们提出了VISER——一个用于评估模拟环境中机器人操作的视觉真实感基准。VISER包含超过1,000个采用物理渲染（PBR）材质的高保真3D资产数据集，以及通过精心策划布局或生成方式利用这些资产创建的3D场景。为此，我们提出了一种利用多模态大语言模型（MLLMs）的自动化流水线，用于材质感知的部件分割和材质检索，从而实现可扩展的物理可信资产生成。基于高保真3D资产数据集，我们构建了多样化的评估任务，如抓取、放置和长时域任务，支持对视觉-语言-动作（VLA）模型进行可扩展且可重复的评估。我们的基准显示模拟性能与现实世界性能之间存在强相关性，不同策略之间的平均皮尔逊相关系数达到0.92。

相关内容

ASSETS

关注 0

ACM SIGACCESS Conference on Computers and Accessibility是为残疾人和老年人提供与计算机相关的设计、评估、使用和教育研究的首要论坛。我们欢迎提交原始的高质量的有关计算和可访问性的主题。今年，ASSETS首次将其范围扩大到包括关于计算机无障碍教育相关主题的原创高质量研究。官网链接：http://assets19.sigaccess.org/

面向具身智能与机器人仿真的三维生成：综述

专知会员服务

18+阅读 · 4月30日

《基于人工智能工具改进战争场景的实时军事训练模拟器综述》

专知会员服务

35+阅读 · 2025年11月4日

面向具身操作的视觉-语言-动作模型综述

专知会员服务

28+阅读 · 2025年8月23日

面向机器人操作的基于大型视觉‑语言模型（VLM）的视觉‑语言‑动作（VLA）模型综述

专知会员服务

34+阅读 · 2025年8月19日