Recently, transformer networks have outperformed traditional deep neural networks in natural language processing and show a large potential in many computer vision tasks compared to convolutional backbones. In the original transformer, readout tokens are used as designated vectors for aggregating information from other tokens. However, the performance of using readout tokens in a vision transformer is limited. Therefore, we propose a novel fusion strategy to integrate radar data into a dense prediction transformer network by reassembling camera representations with radar representations. Instead of using readout tokens, radar representations contribute additional depth information to a monocular depth estimation model and improve performance. We further investigate different fusion approaches that are commonly used for integrating additional modality in a dense prediction transformer network. The experiments are conducted on the nuScenes dataset, which includes camera images, lidar, and radar data. The results show that our proposed method yields better performance than the commonly used fusion strategies and outperforms existing convolutional depth estimation models that fuse camera images and radar.
翻译:近期,Transformer网络在自然语言处理中已超越传统深度神经网络,并在诸多计算机视觉任务中展现出相较于卷积骨干网络的巨大潜力。原始Transformer中,读取令牌被用作从其他令牌聚合信息的指定向量。然而,在视觉Transformer中直接使用读取令牌的性能十分有限。为此,我们提出一种新颖的融合策略,通过将相机表示与雷达表示重新组合,将雷达数据集成至密集预测Transformer网络中。与使用读取令牌不同,雷达表示为单目深度估计模型贡献额外的深度信息,从而提升性能。我们进一步研究了密集预测Transformer网络中常用于整合额外模态的不同融合方法。实验基于nuScenes数据集(包含相机图像、激光雷达与雷达数据)开展,结果表明:我们所提方法不仅优于常用融合策略,且超越了现有融合相机图像与雷达的卷积深度估计模型。