Sensor fusion is an essential topic in many perception systems, such as autonomous driving and robotics. Transformers-based detection head and CNN-based feature encoder to extract features from raw sensor-data has emerged as one of the best performing sensor-fusion 3D-detection-framework, according to the dataset leaderboards. In this work we provide an in-depth literature survey of transformer based 3D-object detection task in the recent past, primarily focusing on the sensor fusion. We also briefly go through the Vision transformers (ViT) basics, so that readers can easily follow through the paper. Moreover, we also briefly go through few of the non-transformer based less-dominant methods for sensor fusion for autonomous driving. In conclusion we summarize with sensor-fusion trends to follow and provoke future research. More updated summary can be found at: https://github.com/ApoorvRoboticist/Transformers-Sensor-Fusion
翻译:传感器融合是许多感知系统(如自动驾驶和机器人)中的关键课题。根据数据集排行榜,基于Transformer的检测头和基于CNN的特征编码器从原始传感器数据中提取特征,已成为表现最佳的传感器融合3D检测框架之一。本文对近期基于Transformer的3D目标检测任务进行了深入的文献综述,重点聚焦于传感器融合。我们还简要介绍了视觉Transformer(ViT)的基础知识,以便读者能够轻松理解全文。此外,我们也简要概述了少数非Transformer主导的自动驾驶传感器融合方法。最后,我们总结了传感器融合的发展趋势并激发未来研究。更多最新总结可参阅:https://github.com/ApoorvRoboticist/Transformers-Sensor-Fusion