Recently, multi-modality models have been introduced because of the complementary information from different sensors such as LiDAR and cameras. It requires paired data along with precise calibrations for all modalities, the complicated calibration among modalities hugely increases the cost of collecting such high-quality datasets, and hinder it from being applied to practical scenarios. Inherit from the previous works, we not only fuse the information from multi-modality without above issues, and also exhaust the information in the RGB modality. We introduced the 2D Detection Annotations Transmittable Aggregation(\textbf{2DDATA}), designing a data-specific branch, called \textbf{Local Object Branch}, which aims to deal with points in a certain bounding box, because of its easiness of acquiring 2D bounding box annotations. We demonstrate that our simple design can transmit bounding box prior information to the 3D encoder model, proving the feasibility of large multi-modality models fused with modality-specific data.
翻译:近年来,由于激光雷达和相机等不同传感器具有互补信息,多模态模型应运而生。这类模型要求所有模态的配对数据与精确标定,而模态间复杂的标定过程极大增加了高质量数据集采集成本,阻碍了其在实际场景中的应用。继承先前工作,我们不仅能在无上述问题的情况下融合多模态信息,还能充分利用RGB模态信息。为此提出二维检测标注可传递聚合方法(2DDATA),设计名为**局部目标分支**的数据专用分支,该分支旨在处理特定边界框内的点云——鉴于二维边界框标注的易获取性。实验表明,这种简洁设计能将边界框先验信息传递至三维编码器模型,验证了融合模态特定数据的大规模多模态模型的可行性。