Safety-critical applications, such as autonomous driving, require extensive multimodal data for rigorous testing. Methods based on synthetic data are gaining prominence due to the cost and complexity of gathering real-world data but require a high degree of realism and controllability in order to be useful. This paper introduces MObI, a novel framework for Multimodal Object Inpainting that leverages a diffusion model to create realistic and controllable object inpaintings across perceptual modalities, demonstrated for both camera and lidar simultaneously. Using a single reference RGB image, MObI enables objects to be seamlessly inserted into existing multimodal scenes at a 3D location specified by a bounding box, while maintaining semantic consistency and multimodal coherence. Unlike traditional inpainting methods that rely solely on edit masks, our 3D bounding box conditioning gives objects accurate spatial positioning and realistic scaling. As a result, our approach can be used to insert novel objects flexibly into multimodal scenes, providing significant advantages for testing perception models.
翻译:自动驾驶等安全关键应用需要大量多模态数据进行严格测试。由于采集真实世界数据的成本和复杂性,基于合成数据的方法正日益受到重视,但这些方法必须具备高度的真实性和可控性才能发挥作用。本文提出MObI,一种新颖的多模态物体修复框架,该框架利用扩散模型在感知模态间创建真实可控的物体修复效果,并同时针对相机与激光雷达进行了验证。MObI仅需单张参考RGB图像,即可将物体无缝插入到现有多模态场景中由边界框指定的三维位置,同时保持语义一致性与多模态协调性。与传统仅依赖编辑掩码的修复方法不同,我们的三维边界框条件约束能为物体提供精确的空间定位与真实的比例缩放。因此,本方法可灵活地将新物体插入多模态场景,为感知模型测试提供显著优势。