This survey explores the adaptation of visual transformer models in Autonomous Driving, a transition inspired by their success in Natural Language Processing. Surpassing traditional Recurrent Neural Networks in tasks like sequential image processing and outperforming Convolutional Neural Networks in global context capture, as evidenced in complex scene recognition, Transformers are gaining traction in computer vision. These capabilities are crucial in Autonomous Driving for real-time, dynamic visual scene processing. Our survey provides a comprehensive overview of Vision Transformer applications in Autonomous Driving, focusing on foundational concepts such as self-attention, multi-head attention, and encoder-decoder architecture. We cover applications in object detection, segmentation, pedestrian detection, lane detection, and more, comparing their architectural merits and limitations. The survey concludes with future research directions, highlighting the growing role of Vision Transformers in Autonomous Driving.
翻译:本综述探讨了视觉Transformer模型在自主驾驶中的适配应用,这一转变源于其在自然语言处理领域的成功。在顺序图像处理等任务中超越传统循环神经网络,并在复杂场景识别中展现出比卷积神经网络更强的全局上下文捕获能力,Transformer正逐步在计算机视觉领域获得关注。这些能力对于自主驾驶中实时动态视觉场景处理至关重要。本文全面概述了视觉Transformer在自主驾驶中的应用,重点关注自注意力、多头注意力和编码器-解码器架构等基础概念。我们涵盖了目标检测、分割、行人检测、车道检测等应用场景,并比较了其架构优势与局限性。最后,本综述提出了未来研究方向,强调了视觉Transformer在自主驾驶中日益重要的作用。