Robust training and validation of Autonomous Driving Systems (ADS) require massive, diverse datasets. Proprietary data collected by Autonomous Vehicle (AV) fleets, while high-fidelity, are limited in scale, diversity of sensor configurations, as well as geographic and long-tail-behavioral coverage. In contrast, in-the-wild data from sources like dashcams offers immense scale and diversity, capturing critical long-tail scenarios and novel environments. However, this unstructured, in-the-wild video data is incompatible with ADS expecting structured, multi-modal sensor inputs for validation and training. To bridge this data gap, we propose Sensor2Sensor, a novel generative modeling paradigm that translates in-the-wild monocular dashcam videos into a high-fidelity, multi-modal sensor suite (AV logs) comprising multi-view camera images and LiDAR point clouds. A core challenge is the lack of paired training data. We address this by converting real AV logs into dashcam-style videos via 4D Gaussian Splatting (4DGS) reconstruction and novel-view rendering. Sensor2Sensor then utilizes a diffusion architecture to perform the generative conversion. We perform comprehensive quantitative evaluations on the fidelity and realism of the generated sensor data. We demonstrate Sensor2Sensor's practical utility by converting challenging in-the-wild internet and dashcam footage into realistic, multi-modal data formats, further unlocking vast external data sources for AV development.
翻译:自动驾驶系统(ADS)的鲁棒训练与验证需要大规模、多样化的数据集。自动驾驶车队采集的专有数据虽保真度高,但在规模、传感器配置多样性以及地理与长尾行为覆盖范围上存在局限。相比之下,来自行车记录仪等来源的野外数据具有极大的规模与多样性,能捕获关键的长尾场景与陌生环境。然而,这种非结构化、单目视频数据与需要结构化多模态传感器输入进行验证与训练的ADS不兼容。为弥合这一数据鸿沟,我们提出Sensor2Sensor——一种新型生成建模范式,可将野外单目行车记录仪视频转化为包含多视角相机图像与激光雷达点云的高保真多模态传感器套件。核心挑战在于缺乏成对训练数据。我们通过利用4D高斯溅射重建和新视角渲染,将真实ADS日志转化为行车记录仪风格视频来解决该问题。Sensor2Sensor随后采用扩散架构执行生成式转换。我们对生成传感器数据的保真度与真实感进行了全面的定量评估。通过将具有挑战性的野外互联网与行车记录仪视频转化为逼真的多模态数据格式,我们展示了Sensor2Sensor的实际效用,进一步为自动驾驶开发解锁了海量外部数据源。