With the development of large language models, many remarkable linguistic systems like ChatGPT have thrived and achieved astonishing success on many tasks, showing the incredible power of foundation models. In the spirit of unleashing the capability of foundation models on vision tasks, the Segment Anything Model (SAM), a vision foundation model for image segmentation, has been proposed recently and presents strong zero-shot ability on many downstream 2D tasks. However, whether SAM can be adapted to 3D vision tasks has yet to be explored, especially 3D object detection. With this inspiration, we explore adapting the zero-shot ability of SAM to 3D object detection in this paper. We propose a SAM-powered BEV processing pipeline to detect objects and get promising results on the large-scale Waymo open dataset. As an early attempt, our method takes a step toward 3D object detection with vision foundation models and presents the opportunity to unleash their power on 3D vision tasks. The code is released at https://github.com/DYZhang09/SAM3D.
翻译:随着大语言模型的发展,诸如ChatGPT等卓越语言系统在许多任务中蓬勃发展并取得惊人成功,展现了基础模型的强大能力。为将基础模型的能力拓展至视觉任务,图像分割视觉基础模型Segment Anything Model (SAM) 近期被提出,并在众多下游二维任务中展现出强大的零样本能力。然而,SAM能否被应用于三维视觉任务,特别是三维目标检测,仍有待探索。受此启发,本文探索将SAM的零样本能力适配到三维目标检测中。我们提出一种基于SAM的BEV处理流程用于检测目标,并在大规模Waymo公开数据集上取得了令人满意的结果。作为早期尝试,我们的方法迈出了利用视觉基础模型实现三维目标检测的一步,并为释放其在三维视觉任务中的潜力提供了契机。代码已发布于https://github.com/DYZhang09/SAM3D。