Synthetica：面向机器人感知的大规模合成数据 (Synthetica: Large Scale Synthetic Data for Robot Perception)

Vision-based object detectors are a crucial basis for robotics applications as they provide valuable information about object localisation in the environment. These need to ensure high reliability in different lighting conditions, occlusions, and visual artifacts, all while running in real-time. Collecting and annotating real-world data for these networks is prohibitively time consuming and costly, especially for custom assets, such as industrial objects, making it untenable for generalization to in-the-wild scenarios. To this end, we present Synthetica, a method for large-scale synthetic data generation for training robust state estimators. This paper focuses on the task of object detection, an important problem which can serve as the front-end for most state estimation problems, such as pose estimation. Leveraging data from a photorealistic ray-tracing renderer, we scale up data generation, generating 2.7 million images, to train highly accurate real-time detection transformers. We present a collection of rendering randomization and training-time data augmentation techniques conducive to robust sim-to-real performance for vision tasks. We demonstrate state-of-the-art performance on the task of object detection while having detectors that run at 50-100Hz which is 9 times faster than the prior SOTA. We further demonstrate the usefulness of our training methodology for robotics applications by showcasing a pipeline for use in the real world with custom objects for which there do not exist prior datasets. Our work highlights the importance of scaling synthetic data generation for robust sim-to-real transfer while achieving the fastest real-time inference speeds. Videos and supplementary information can be found at this URL: https://sites.google.com/view/synthetica-vision.

翻译：基于视觉的目标检测器是机器人应用的关键基础，因其能够提供环境中物体定位的宝贵信息。这些检测器需要在不同光照条件、遮挡和视觉伪影下确保高可靠性，同时保持实时运行。为这些网络收集和标注真实世界数据极其耗时且成本高昂，尤其对于工业对象等定制资产而言，这使得其难以推广到野外场景。为此，我们提出Synthetica，一种用于训练鲁棒状态估计器的大规模合成数据生成方法。本文聚焦于目标检测任务——这一重要问题可作为大多数状态估计问题（如姿态估计）的前端。通过利用基于光线追踪的光线渲染器生成的数据，我们扩展了数据生成规模，生成了270万张图像，用于训练高精度的实时检测Transformer。我们提出了一系列有利于视觉任务实现鲁棒仿真到真实性能的渲染随机化与训练时数据增强技术。我们在目标检测任务上展示了最先进的性能，同时检测器运行速度达到50-100Hz，比先前的最优方法快9倍。我们进一步通过展示一个针对自定义物体（现有数据集中不存在的物体）在现实世界中使用的流程，证明了我们的训练方法在机器人应用中的实用性。我们的工作凸显了扩展合成数据生成对于实现鲁棒仿真到真实迁移的重要性，同时实现了最快的实时推理速度。视频及补充信息请访问此网址：https://sites.google.com/view/synthetica-vision。