RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation

The pursuit of robot generalists, agents capable of performing diverse tasks across diverse environments, demands rigorous and scalable evaluation. Yet real-world testing of robot policies remains fundamentally constrained: it is labor-intensive, slow, unsafe at scale, and difficult to reproduce. As policies expand in scope and complexity, these barriers only intensify, since defining "success" in robotics often hinges on nuanced human judgments of execution quality. We introduce RobotArena Infinity, a new benchmarking framework that overcomes these challenges by shifting vision-language-action (VLA) evaluation into large-scale simulated environments augmented with online human feedback. Leveraging advances in vision-language models, 2D-to-3D generative modeling, and differentiable rendering, our approach automatically converts video demonstrations from widely used robot datasets into simulated counterparts. Within these digital twins, we assess VLA policies using both automated vision-language-model-guided scoring and scalable human preference judgments collected from crowdworkers, transforming human involvement from tedious scene setup, resetting, and safety supervision into lightweight preference comparisons. To measure robustness, we systematically perturb simulated environments along multiple axes, including textures and object placements, stress-testing policy generalization under controlled variation. The result is a continuously evolving, reproducible, and scalable benchmark for real-world-trained robot manipulation policies, addressing a critical missing capability in today's robotics landscape.

翻译：追求机器人通用智能体——即能够在多样化环境中执行多样化任务的智能体——需要严谨且可扩展的评估方法。然而，机器人策略的现实世界测试仍然受到根本性限制：它劳动密集、速度缓慢、大规模测试不安全且难以复现。随着策略在范围和复杂性上的扩展，这些障碍只会加剧，因为机器人学中“成功”的定义往往取决于人类对执行质量的细致判断。我们提出了 RobotArena Infinity，这是一个新的基准测试框架，通过将视觉-语言-动作评估转移到结合在线人类反馈的大规模仿真环境中，克服了这些挑战。利用视觉语言模型、2D到3D生成建模以及可微分渲染方面的进展，我们的方法能够自动将广泛使用的机器人数据集中的视频演示转换为对应的仿真版本。在这些数字孪生环境中，我们使用自动化的视觉语言模型引导评分以及从众包工作者收集的可扩展的人类偏好判断来评估 VLA 策略，从而将人类参与从繁琐的场景设置、重置和安全监督转变为轻量级的偏好比较。为了衡量鲁棒性，我们沿着多个维度（包括纹理和物体放置）系统地扰动仿真环境，在受控变化下对策略的泛化能力进行压力测试。其结果是一个持续演进、可复现且可扩展的基准测试，用于评估在现实世界训练的机器人操作策略，弥补了当前机器人领域的一项关键缺失能力。