Interpretability of Deep Neural Networks (DNNs) is a growing field driven by the study of vision and language models. Yet, some use cases, like image captioning, or domains like Deep Reinforcement Learning (DRL), require complex modelling, with multiple inputs and outputs or use composable and separated networks. As a consequence, they rarely fit natively into the API of popular interpretability frameworks. We thus present TDHook, an open-source, lightweight, generic interpretability framework based on $\texttt{tensordict}$ and applicable to any $\texttt{torch}$ model. It focuses on handling complex composed models which can be trained for Computer Vision, Natural Language Processing, Reinforcement Learning or any other domain. This library features ready-to-use methods for attribution, probing and a flexible get-set API for interventions, and is aiming to bridge the gap between these method classes to make modern interpretability pipelines more accessible. TDHook is designed with minimal dependencies, requiring roughly half as much disk space as $\texttt{transformer_lens}$, and, in our controlled benchmark, achieves up to a $\times$2 speed-up over $\texttt{captum}$ when running integrated gradients for multi-target pipelines on both CPU and GPU. In addition, to value our work, we showcase concrete use cases of our library with composed interpretability pipelines in Computer Vision (CV) and Natural Language Processing (NLP), as well as with complex models in DRL.
翻译:深度神经网络(DNN)的可解释性是一个不断发展的领域,其驱动力主要来自对视觉和语言模型的研究。然而,某些应用场景(如图像描述生成)或领域(如深度强化学习(DRL))需要复杂的建模,涉及多个输入和输出,或使用可组合且分离的网络。因此,这些模型通常难以直接适配主流可解释性框架的API。为此,我们提出了TDHook——一个基于$\texttt{tensordict}$的开源、轻量级、通用的可解释性框架,适用于任何$\texttt{torch}$模型。它专注于处理可训练用于计算机视觉、自然语言处理、强化学习或其他领域的复杂组合模型。该库提供了开箱即用的归因方法、探测方法,以及用于干预的灵活get-set API,旨在弥合这些方法类别之间的差距,使现代可解释性流程更易于使用。TDHook设计上依赖极少,所需磁盘空间约为$\texttt{transformer_lens}$的一半,并且在我们的受控基准测试中,在CPU和GPU上运行多目标流程的积分梯度计算时,相比$\texttt{captum}$实现了最高达$\times$2的加速。此外,为体现本工作的价值,我们展示了该库在计算机视觉(CV)和自然语言处理(NLP)中组合可解释性流程的具体用例,以及在DRL中复杂模型上的应用实例。