LidarCLIP or: How I Learned to Talk to Point Clouds

Research connecting text and images has recently seen several breakthroughs, with models like CLIP, DALL-E 2, and Stable Diffusion. However, the connection between text and other visual modalities, such as lidar data, has received less attention, prohibited by the lack of text-lidar datasets. In this work, we propose LidarCLIP, a mapping from automotive point clouds to a pre-existing CLIP embedding space. Using image-lidar pairs, we supervise a point cloud encoder with the image CLIP embeddings, effectively relating text and lidar data with the image domain as an intermediary. We show the effectiveness of LidarCLIP by demonstrating that lidar-based retrieval is generally on par with image-based retrieval, but with complementary strengths and weaknesses. By combining image and lidar features, we improve upon both single-modality methods and enable a targeted search for challenging detection scenarios under adverse sensor conditions. We also explore zero-shot classification and show that LidarCLIP outperforms existing attempts to use CLIP for point clouds by a large margin. Finally, we leverage our compatibility with CLIP to explore a range of applications, such as point cloud captioning and lidar-to-image generation, without any additional training. Code and pre-trained models are available at https://github.com/atonderski/lidarclip.

翻译：近期，连接文本与图像的研究取得多项突破，如CLIP、DALL-E 2和Stable Diffusion等模型。然而，文本与其他视觉模态（如激光雷达数据）之间的关联却较少受到关注，主要受限于缺乏文本-激光雷达数据集。本研究提出LidarCLIP——一种将车载点云映射到预训练CLIP嵌入空间的方法。通过使用图像-激光雷达配对数据，我们利用图像的CLIP嵌入监督点云编码器，从而以图像域为中介有效关联文本与激光雷达数据。我们通过实验证明，基于激光雷达的检索性能通常与基于图像的检索相当，但两者各有互补优势与局限。通过融合图像与激光雷达特征，我们不仅提升了单一模态方法的性能，还能够在恶劣传感器条件下针对具有挑战性的检测场景进行定向搜索。此外，我们探索了零样本分类，结果表明LidarCLIP在此任务上大幅优于现有利用CLIP处理点云的方法。最后，借助与CLIP的兼容性，我们在无需额外训练的情况下探索了点云描述生成、激光雷达-图像生成等多种应用。代码与预训练模型已开源至https://github.com/atonderski/lidarclip。

相关内容

点云

关注 50

根据激光测量原理得到的点云，包括三维坐标（XYZ）和激光反射强度（Intensity）。根据摄影测量原理得到的点云，包括三维坐标（XYZ）和颜色信息（RGB）。结合激光测量和摄影测量原理得到点云，包括三维坐标（XYZ）、激光反射强度（Intensity）和颜色信息（RGB）。在获取物体表面每个采样点的空间坐标后，得到的是一个点的集合，称之为“点云”(Point Cloud)

ICLR 2021杰出论文奖出炉，8篇论文上榜！

专知会员服务

26+阅读 · 2021年4月2日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日