Understanding the mechanisms underlying deep neural networks in computer vision remains a fundamental challenge. While many previous approaches have focused on visualizing intermediate representations within deep neural networks, particularly convolutional neural networks, these techniques have yet to be thoroughly explored in transformer-based vision models. In this study, we apply a modular approach of training inverse models to reconstruct input images from intermediate layers within a Detection Transformer and a Vision Transformer, showing that this approach is efficient and feasible. Through qualitative and quantitative evaluations of reconstructed images, we generate insights into the underlying mechanisms of these architectures, highlighting their similarities and differences in terms of contextual shape and preservation of image details, inter-layer correlation, and robustness to color perturbations. Our analysis illustrates how these properties emerge within the models, contributing to a deeper understanding of transformer-based vision models. The code for reproducing our experiments is available at github.com/wiskott-lab/inverse-tvm.
翻译:理解计算机视觉中深度神经网络的内在机制仍然是一个基础性挑战。尽管先前许多方法侧重于可视化深度神经网络(特别是卷积神经网络)的中间表示,但这些技术在基于Transformer的视觉模型中尚未得到深入探索。本研究采用训练反演模型的模块化方法,从Detection Transformer和Vision Transformer的中间层重建输入图像,证明了该方法的效率与可行性。通过对重建图像进行定性与定量评估,我们深入揭示了这些架构的内在机制,重点阐明了它们在上下文形状与图像细节保留、层间相关性以及对颜色扰动的鲁棒性等方面的共性与差异。我们的分析展示了这些特性如何在模型内部形成,从而促进对基于Transformer的视觉模型的更深入理解。复现实验的代码可在github.com/wiskott-lab/inverse-tvm获取。