Vision Transformers (ViTs) have achieved state-of-the-art results on various computer vision tasks, including 3D object detection. However, their end-to-end implementation also makes ViTs less explainable, which can be a challenge for deploying them in safety-critical applications, such as autonomous driving, where it is important for authorities, developers, and users to understand the model's reasoning behind its predictions. In this paper, we propose a novel method for generating saliency maps for a DetR-like ViT with multiple camera inputs used for 3D object detection. Our method is based on the raw attention and is more efficient than gradient-based methods. We evaluate the proposed method on the nuScenes dataset using extensive perturbation tests and show that it outperforms other explainability methods in terms of visual quality and quantitative metrics. We also demonstrate the importance of aggregating attention across different layers of the transformer. Our work contributes to the development of explainable AI for ViTs, which can help increase trust in AI applications by establishing more transparency regarding the inner workings of AI models.
翻译:视觉Transformer(ViTs)已在包括三维目标检测在内的多种计算机视觉任务中取得了最先进的成果。然而,其端到端实现也导致ViTs的可解释性较低,这在将其部署于安全关键应用(如自动驾驶)时构成挑战——因为对于当局、开发者和用户而言,理解模型预测背后的推理过程至关重要。本文针对用于三维目标检测的DetR类ViT模型(使用多相机输入),提出了一种新颖的显著性图生成方法。该方法基于原始注意力机制,比基于梯度的方法更加高效。我们利用nuScenes数据集,通过广泛的扰动测试对提出的方法进行评估,结果表明其在视觉质量和量化指标上均优于其他可解释性方法。此外,我们还论证了跨不同Transformer层聚合注意力的重要性。本研究为基于ViTs的可解释人工智能发展做出贡献,通过提升AI模型内部运作的透明度,有助于增强对AI应用的信任。