Find n' Propagate: Open-Vocabulary 3D Object Detection in Urban Environments

In this work, we tackle the limitations of current LiDAR-based 3D object detection systems, which are hindered by a restricted class vocabulary and the high costs associated with annotating new object classes. Our exploration of open-vocabulary (OV) learning in urban environments aims to capture novel instances using pre-trained vision-language models (VLMs) with multi-sensor data. We design and benchmark a set of four potential solutions as baselines, categorizing them into either top-down or bottom-up approaches based on their input data strategies. While effective, these methods exhibit certain limitations, such as missing novel objects in 3D box estimation or applying rigorous priors, leading to biases towards objects near the camera or of rectangular geometries. To overcome these limitations, we introduce a universal \textsc{Find n' Propagate} approach for 3D OV tasks, aimed at maximizing the recall of novel objects and propagating this detection capability to more distant areas thereby progressively capturing more. In particular, we utilize a greedy box seeker to search against 3D novel boxes of varying orientations and depth in each generated frustum and ensure the reliability of newly identified boxes by cross alignment and density ranker. Additionally, the inherent bias towards camera-proximal objects is alleviated by the proposed remote simulator, which randomly diversifies pseudo-labeled novel instances in the self-training process, combined with the fusion of base samples in the memory bank. Extensive experiments demonstrate a 53% improvement in novel recall across diverse OV settings, VLMs, and 3D detectors. Notably, we achieve up to a 3.97-fold increase in Average Precision (AP) for novel object classes. The source code is made available at https://github.com/djamahl99/findnpropagate.

翻译：本研究针对当前基于激光雷达的3D目标检测系统存在的局限性：受限的类别词汇表以及标注新物体类别的高昂成本。我们在城市环境中探索开放词汇学习，旨在利用预训练的视觉-语言模型与多传感器数据捕捉新颖实例。我们设计并基准测试了四种潜在解决方案作为基线，根据其输入数据策略将其归类为自上而下或自下而上的方法。尽管这些方法有效，但仍存在某些局限性，例如在3D边界框估计中遗漏新颖物体，或应用严格先验导致对相机附近物体或矩形几何体的偏好。为克服这些限制，我们提出了一种通用的\textsc{Find n' Propagate}方法用于3D开放词汇任务，旨在最大化新颖物体的召回率，并将此检测能力传播到更远区域，从而逐步捕获更多目标。具体而言，我们采用贪心边界框搜索器在每个生成的视锥体中搜索不同方向和深度的3D新颖边界框，并通过交叉对齐与密度排序器确保新识别边界框的可靠性。此外，通过提出的远程模拟器——在自训练过程中随机多样化伪标注的新颖实例，并结合记忆库中基础样本的融合——缓解了固有的相机近端物体偏好。大量实验表明，在不同开放词汇设置、视觉-语言模型和3D检测器中，新颖物体召回率提升了53%。值得注意的是，新颖物体类别的平均精度最高提升了3.97倍。源代码发布于https://github.com/djamahl99/findnpropagate。