Research connecting text and images has recently seen several breakthroughs, with models like CLIP, DALL-E 2, and Stable Diffusion. However, the connection between text and other visual modalities, such as lidar data, has received less attention, prohibited by the lack of text-lidar datasets. In this work, we propose LidarCLIP, a mapping from automotive point clouds to a pre-existing CLIP embedding space. Using image-lidar pairs, we supervise a point cloud encoder with the image CLIP embeddings, effectively relating text and lidar data with the image domain as an intermediary. We show the effectiveness of LidarCLIP by demonstrating that lidar-based retrieval is generally on par with image-based retrieval, but with complementary strengths and weaknesses. By combining image and lidar features, we improve upon both single-modality methods and enable a targeted search for challenging detection scenarios under adverse sensor conditions. We also explore zero-shot classification and show that LidarCLIP outperforms existing attempts to use CLIP for point clouds by a large margin. Finally, we leverage our compatibility with CLIP to explore a range of applications, such as point cloud captioning and lidar-to-image generation, without any additional training. Code and pre-trained models are available at https://github.com/atonderski/lidarclip.
翻译:近期,连接文本与图像的研究取得了多项突破,例如CLIP、DALL-E 2和Stable Diffusion等模型。然而,文本与其他视觉模态(如激光雷达数据)之间的联系却鲜受关注,主要受限于缺乏文本-激光雷达数据集。在本工作中,我们提出LidarCLIP——一种从汽车点云到预构建CLIP嵌入空间的映射方法。通过利用图像-激光雷达对,我们使用图像的CLIP嵌入来监督点云编码器,从而以图像域为中介有效关联文本与激光雷达数据。我们验证了LidarCLIP的有效性:基于激光雷达的检索性能通常与基于图像的检索相当,但二者具有互补的优缺点。通过融合图像与激光雷达特征,我们超越了单一模态方法,并能在恶劣传感器条件下针对具有挑战性的检测场景实现定向搜索。我们还探索了零样本分类,结果表明LidarCLIP在点云上大幅超越了现有利用CLIP的尝试。最后,我们利用LidarCLIP与CLIP的兼容性,无需额外训练即可探索多种应用,如点云描述生成和激光雷达到图像生成。代码与预训练模型已开源:https://github.com/atonderski/lidarclip。