Understanding the mechanisms underlying deep neural networks in computer vision remains a fundamental challenge. While many prior approaches have focused on visualizing intermediate representations within deep neural networks, particularly convolutional neural networks, these techniques have yet to be thoroughly explored in transformer-based vision models. In this study, we apply the approach of training inverse models to reconstruct input images from intermediate layers within a Detection Transformer, showing that this approach is efficient and feasible for transformer-based vision models. Through qualitative and quantitative evaluations of reconstructed images across model stages, we demonstrate critical properties of Detection Transformers, including contextual shape preservation, inter-layer correlation, and robustness to color perturbations, illustrating how these characteristics emerge within the model's architecture. Our findings contribute to a deeper understanding of transformer-based vision models. The code for reproducing our experiments will be made available at github.com/wiskott-lab/inverse-detection-transformer.
翻译:理解深度神经网络在计算机视觉中的工作机制仍是一个基础性挑战。尽管先前许多方法专注于可视化深度神经网络(特别是卷积神经网络)的中间表征,但这些技术在基于Transformer的视觉模型中尚未得到充分探索。在本研究中,我们采用训练逆模型的方法,从检测Transformer的中间层重建输入图像,证明该方法对于基于Transformer的视觉模型是高效且可行的。通过对不同模型阶段重建图像的定性与定量评估,我们揭示了检测Transformer的关键特性,包括上下文形状保持、层间关联性以及对颜色扰动的鲁棒性,阐明了这些特性如何在模型架构中形成。我们的研究结果有助于深化对基于Transformer的视觉模型的理解。实验复现代码将在github.com/wiskott-lab/inverse-detection-transformer公开。