In this work, we tackle the limitations of current LiDAR-based 3D object detection systems, which are hindered by a restricted class vocabulary and the high costs associated with annotating new object classes. Our exploration of open-vocabulary (OV) learning in urban environments aims to capture novel instances using pre-trained vision-language models (VLMs) with multi-sensor data. We design and benchmark a set of four potential solutions as baselines, categorizing them into either top-down or bottom-up approaches based on their input data strategies. While effective, these methods exhibit certain limitations, such as missing novel objects in 3D box estimation or applying rigorous priors, leading to biases towards objects near the camera or of rectangular geometries. To overcome these limitations, we introduce a universal \textsc{Find n' Propagate} approach for 3D OV tasks, aimed at maximizing the recall of novel objects and propagating this detection capability to more distant areas thereby progressively capturing more. In particular, we utilize a greedy box seeker to search against 3D novel boxes of varying orientations and depth in each generated frustum and ensure the reliability of newly identified boxes by cross alignment and density ranker. Additionally, the inherent bias towards camera-proximal objects is alleviated by the proposed remote simulator, which randomly diversifies pseudo-labeled novel instances in the self-training process, combined with the fusion of base samples in the memory bank. Extensive experiments demonstrate a 53% improvement in novel recall across diverse OV settings, VLMs, and 3D detectors. Notably, we achieve up to a 3.97-fold increase in Average Precision (AP) for novel object classes. The source code is made available in the supplementary material.
翻译:本研究针对当前基于激光雷达的三维目标检测系统存在的局限,即受限于有限的类别词汇以及标注新目标类别的高昂成本。我们探索了城市环境中的开放词汇学习,旨在利用预训练的视觉-语言模型和多传感器数据捕获新类实例。我们设计并基准测试了四组潜在解决方案作为基线,根据其输入数据策略将其分为自上而下或自下而上两类。这些方法虽有效,但存在局限性,例如在三维框估计中遗漏新目标,或应用严格先验导致对近相机目标或矩形几何形状产生偏差。为克服这些局限,我们提出一种通用的“Find n' Propagate”方法用于三维开放词汇任务,旨在最大化新目标的召回率,并将检测能力传播至更远区域,从而逐步捕获更多目标。具体而言,我们利用贪婪框搜索器在每个生成的视锥内搜索不同朝向和深度的三维新框,并通过交叉对齐与密度排序器确保新识别框的可靠性。此外,所提出的远程模拟器通过随机多样化自训练过程中的伪标签新实例,并结合记忆库中基础样本的融合,缓解了对近相机目标的固有偏差。大量实验表明,在多种开放词汇设置、视觉-语言模型和三维检测器下,新目标召回率提升53%。值得注意的是,新目标类别的平均精度提高达3.97倍。源代码已在补充材料中提供。