Assessing Quality Metrics for Neural Reality Gap Input Mitigation in Autonomous Driving Testing

Simulation-based testing of automated driving systems (ADS) is the industry standard, being a controlled, safe, and cost-effective alternative to real-world testing. Despite these advantages, virtual simulations often fail to accurately replicate real-world conditions like image fidelity, texture representation, and environmental accuracy. This can lead to significant differences in ADS behavior between simulated and real-world domains, a phenomenon known as the sim2real gap. Researchers have used Image-to-Image (I2I) neural translation to mitigate the sim2real gap, enhancing the realism of simulated environments by transforming synthetic data into more authentic representations of real-world conditions. However, while promising, these techniques may potentially introduce artifacts, distortions, or inconsistencies in the generated data that can affect the effectiveness of ADS testing. In our empirical study, we investigated how the quality of image-to-image (I2I) techniques influences the mitigation of the sim2real gap, using a set of established metrics from the literature. We evaluated two popular generative I2I architectures, pix2pix, and CycleGAN, across two ADS perception tasks at a model level, namely vehicle detection and end-to-end lane keeping, using paired simulated and real-world datasets. Our findings reveal that the effectiveness of I2I architectures varies across different ADS tasks, and existing evaluation metrics do not consistently align with the ADS behavior. Thus, we conducted task-specific fine-tuning of perception metrics, which yielded a stronger correlation. Our findings indicate that a perception metric that incorporates semantic elements, tailored to each task, can facilitate selecting the most appropriate I2I technique for a reliable assessment of the sim2real gap mitigation.

翻译：基于仿真的自动驾驶系统测试是行业标准，作为真实世界测试的一种可控、安全且经济高效的替代方案。尽管有这些优势，虚拟仿真往往无法准确复现真实世界的条件，如图像保真度、纹理表征和环境精确度。这可能导致自动驾驶系统在仿真域与真实域之间的行为存在显著差异，这种现象被称为sim2real差距。研究人员已采用图像到图像神经翻译技术来缓解sim2real差距，通过将合成数据转换为更逼真的真实世界条件表征来增强仿真环境的真实感。然而，尽管这些技术前景广阔，它们可能在生成的数据中引入伪影、失真或不一致性，从而影响自动驾驶系统测试的有效性。在本实证研究中，我们利用文献中一套公认的指标，探究了图像到图像（I2I）技术的质量如何影响sim2real差距的缓解效果。我们在模型层面评估了两种流行的生成式I2I架构pix2pix和CycleGAN，针对车辆检测与端到端车道保持这两项自动驾驶感知任务，使用了配对的仿真与真实世界数据集。我们的研究结果表明，I2I架构的有效性因自动驾驶任务不同而异，且现有评估指标与自动驾驶系统行为之间并不始终一致。因此，我们进行了面向特定任务的感知指标微调，从而获得了更强的相关性。本研究发现表明，一种融入语义元素、针对每项任务定制的感知指标，有助于为可靠评估sim2real差距缓解效果选择最合适的I2I技术。