Good 3D object detection performance from LiDAR-Camera sensors demands seamless feature alignment and fusion strategies. We propose the 3DifFusionDet framework in this paper, which structures 3D object detection as a denoising diffusion process from noisy 3D boxes to target boxes. In this framework, ground truth boxes diffuse in a random distribution for training, and the model learns to reverse the noising process. During inference, the model gradually refines a set of boxes that were generated at random to the outcomes. Under the feature align strategy, the progressive refinement method could make a significant contribution to robust LiDAR-Camera fusion. The iterative refinement process could also demonstrate great adaptability by applying the framework to various detecting circumstances where varying levels of accuracy and speed are required. Extensive experiments on KITTI, a benchmark for real-world traffic object identification, revealed that 3DifFusionDet is able to perform favorably in comparison to earlier, well-respected detectors.
翻译:优秀的激光雷达-相机三维目标检测性能依赖于无缝的特征对齐与融合策略。本文提出3DifFusionDet框架,将三维目标检测建模为从含噪三维框到目标框的去噪扩散过程。在该框架中,真实框通过随机分布扩散进行训练,模型学习逆转噪声添加过程。推理阶段,模型逐步优化随机生成的初始框集合直至得到最终结果。渐进式精炼方法在特征对齐策略下,可显著提升鲁棒激光雷达-相机融合效果。通过将该框架应用于不同精度与速度需求的多类检测场景,迭代精炼过程展现出强大适应性。在真实交通目标识别基准KITTI上的大量实验表明,3DifFusionDet相较于先前广受认可的检测器具有更优性能。