An important paradigm in 3D object detection is the use of multiple modalities to enhance accuracy in both normal and challenging conditions, particularly for long-tail scenarios. To address this, recent studies have explored two directions of adaptive approaches: MoE-based adaptive fusion, which struggles with uncertainties arising from distinct object configurations, and late fusion for output-level adaptive fusion, which relies on separate detection pipelines and limits comprehensive understanding. In this work, we introduce Cocoon, an object- and feature-level uncertainty-aware fusion framework. The key innovation lies in uncertainty quantification for heterogeneous representations, enabling fair comparison across modalities through the introduction of a feature aligner and a learnable surrogate ground truth, termed feature impression. We also define a training objective to ensure that their relationship provides a valid metric for uncertainty quantification. Cocoon consistently outperforms existing static and adaptive methods in both normal and challenging conditions, including those with natural and artificial corruptions. Furthermore, we show the validity and efficacy of our uncertainty metric across diverse datasets.
翻译:三维物体检测中的一个重要范式是利用多模态信息来提升正常与挑战性条件下的检测精度,尤其是在长尾场景中。为此,近期研究探索了两个自适应方法方向:基于专家混合模型的自适应融合,其难以处理由不同物体构型引起的不确定性;以及用于输出级自适应融合的后期融合,其依赖于独立的检测流程,限制了综合理解。在本工作中,我们提出了Cocoon,一个物体级与特征级的不确定性感知融合框架。其核心创新在于对异构表示进行不确定性量化,通过引入特征对齐器和一个可学习的代理真值(称为特征印象),实现了跨模态的公平比较。我们还定义了一个训练目标,以确保它们之间的关系能为不确定性量化提供有效的度量标准。Cocoon在正常及挑战性条件下(包括存在自然与人为干扰的情况)均持续优于现有的静态与自适应方法。此外,我们在多个数据集上验证了所提不确定性度量的有效性与高效性。