THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation

To realize effective large-scale, real-world robotic applications, we must evaluate how well our robot policies adapt to changes in environmental conditions. Unfortunately, a majority of studies evaluate robot performance in environments closely resembling or even identical to the training setup. We present THE COLOSSEUM, a novel simulation benchmark, with 20 diverse manipulation tasks, that enables systematical evaluation of models across 14 axes of environmental perturbations. These perturbations include changes in color, texture, and size of objects, table-tops, and backgrounds; we also vary lighting, distractors, physical properties perturbations and camera pose. Using THE COLOSSEUM, we compare 5 state-of-the-art manipulation models to reveal that their success rate degrades between 30-50% across these perturbation factors. When multiple perturbations are applied in unison, the success rate degrades $\geq$75%. We identify that changing the number of distractor objects, target object color, or lighting conditions are the perturbations that reduce model performance the most. To verify the ecological validity of our results, we show that our results in simulation are correlated ($\bar{R}^2 = 0.614$) to similar perturbations in real-world experiments. We open source code for others to use THE COLOSSEUM, and also release code to 3D print the objects used to replicate the real-world perturbations. Ultimately, we hope that THE COLOSSEUM will serve as a benchmark to identify modeling decisions that systematically improve generalization for manipulation. See https://robot-colosseum.github.io/ for more details.

翻译：为实现有效的大规模现实世界机器人应用，我们必须评估机器人策略对环境条件变化的适应能力。然而，大多数研究在训练设置高度相似甚至完全相同的环境中评估机器人性能。我们提出了THE COLOSSEUM，一个新颖的仿真基准，包含20个多样化的操作任务，能够沿着14个环境扰动维度对模型进行系统评估。这些扰动包括物体、桌面和背景的颜色、纹理及尺寸变化；我们还改变了光照、干扰物、物理属性以及相机位姿。利用THE COLOSSEUM，我们比较了5个最先进的操作模型，发现它们在上述扰动因素下的成功率下降了30-50%。当多种扰动同时作用时，成功率下降幅度≥75%。我们发现，改变干扰物数量、目标物体颜色或光照条件是导致模型性能下降最显著的扰动因素。为验证结果的生态效度，我们证明了仿真结果与真实世界实验中类似扰动的表现具有相关性（$\bar{R}^2 = 0.614$）。我们开源了THE COLOSSEUM的使用代码，并发布了用于3D打印物体的代码以复现实世界扰动。最终，我们希望THE COLOSSEUM能作为一个基准，帮助识别能够系统性提升操作泛化能力的建模决策。详见 https://robot-colosseum.github.io/。