Monocular 3D object detection typically relies on pseudo-labeling techniques to reduce dependency on real-world annotations. Recent advances demonstrate that deterministic linguistic cues can serve as effective auxiliary weak supervision signals, providing complementary semantic context. However, hand-crafted textual descriptions struggle to capture the inherent visual diversity of individuals across scenes, limiting the model's ability to learn scene-aware representations. To address this challenge, we propose Visual-referred Probabilistic Prompt Learning (VirPro), an adaptive multi-modal pretraining paradigm that can be seamlessly integrated into diverse weakly supervised monocular 3D detection frameworks. Specifically, we generate a diverse set of learnable, instance-conditioned prompts across scenes and store them in an Adaptive Prompt Bank (APB). Subsequently, we introduce Multi-Gaussian Prompt Modeling (MGPM), which incorporates scene-based visual features into the corresponding textual embeddings, allowing the text prompts to express visual uncertainties. Then, from the fused vision-language embeddings, we decode a prompt-targeted Gaussian, from which we derive a unified object-level prompt embedding for each instance. RoI-level contrastive matching is employed to enforce modality alignment, bringing embeddings of co-occurring objects within the same scene closer in the latent space, thus enhancing semantic coherence. Extensive experiments on the KITTI benchmark demonstrate that integrating our pretraining paradigm consistently yields substantial performance gains, achieving up to a 4.8% average precision improvement than the baseline.
翻译:单目三维物体检测通常依赖伪标签技术以减少对真实世界标注的依赖。最新研究表明,确定性语言线索可作为有效的辅助弱监督信号,提供互补的语义上下文。然而,手工构建的文本描述难以捕捉不同场景中个体固有的视觉多样性,限制了模型学习场景感知表征的能力。为解决这一挑战,我们提出视觉参考概率提示学习(VirPro),这是一种可自适应融入多种弱监督单目三维检测框架的多模态预训练范式。具体而言,我们生成一组跨场景的多样化、可学习的实例条件提示,并将其存储于自适应提示库中。随后,我们引入多高斯提示建模,将基于场景的视觉特征融入对应的文本嵌入,使文本提示能够表达视觉不确定性。接着,从融合的视觉-语言嵌入中解码出提示目标高斯分布,从中推导出每个实例的统一对象级提示嵌入。通过采用感兴趣区域级对比匹配来强制模态对齐,使同一场景中共现对象的嵌入在潜在空间中更接近,从而增强语义连贯性。在KITTI基准上的大量实验表明,集成我们的预训练范式能持续带来显著的性能提升,相比基线方法最高可获得4.8%的平均精度提升。