TDHook：一个轻量级可解释性框架 (TDHook: A Lightweight Framework for Interpretability)

Interpretability of Deep Neural Networks (DNNs) is a growing field driven by the study of vision and language models. Yet, some use cases, like image captioning, or domains like Deep Reinforcement Learning (DRL), require complex modelling, with multiple inputs and outputs or use composable and separated networks. As a consequence, they rarely fit natively into the API of popular interpretability frameworks. We thus present TDHook, an open-source, lightweight, generic interpretability framework based on $\texttt{tensordict}$ and applicable to any $\texttt{torch}$ model. It focuses on handling complex composed models which can be trained for Computer Vision, Natural Language Processing, Reinforcement Learning or any other domain. This library features ready-to-use methods for attribution, probing and a flexible get-set API for interventions, and is aiming to bridge the gap between these method classes to make modern interpretability pipelines more accessible. TDHook is designed with minimal dependencies, requiring roughly half as much disk space as $\texttt{transformer_lens}$, and, in our controlled benchmark, achieves up to a $\times$2 speed-up over $\texttt{captum}$ when running integrated gradients for multi-target pipelines on both CPU and GPU. In addition, to value our work, we showcase concrete use cases of our library with composed interpretability pipelines in Computer Vision (CV) and Natural Language Processing (NLP), as well as with complex models in DRL.

翻译：深度神经网络（DNN）的可解释性是一个不断发展的领域，其驱动力主要来自对视觉和语言模型的研究。然而，某些应用场景（如图像描述生成）或领域（如深度强化学习（DRL））需要复杂的建模，涉及多个输入和输出，或使用可组合且分离的网络。因此，这些模型通常难以直接适配主流可解释性框架的API。为此，我们提出了TDHook——一个基于$\texttt{tensordict}$的开源、轻量级、通用的可解释性框架，适用于任何$\texttt{torch}$模型。它专注于处理可训练用于计算机视觉、自然语言处理、强化学习或其他领域的复杂组合模型。该库提供了开箱即用的归因方法、探测方法，以及用于干预的灵活get-set API，旨在弥合这些方法类别之间的差距，使现代可解释性流程更易于使用。TDHook设计上依赖极少，所需磁盘空间约为$\texttt{transformer_lens}$的一半，并且在我们的受控基准测试中，在CPU和GPU上运行多目标流程的积分梯度计算时，相比$\texttt{captum}$实现了最高达$\times$2的加速。此外，为体现本工作的价值，我们展示了该库在计算机视觉（CV）和自然语言处理（NLP）中组合可解释性流程的具体用例，以及在DRL中复杂模型上的应用实例。

相关内容

可解释性

关注 81

广义上的可解释性指在我们需要了解或解决一件事情的时候，我们可以获得我们所需要的足够的可以理解的信息，也就是说一个人能够持续预测模型结果的程度。按照可解释性方法进行的过程进行划分的话，大概可以划分为三个大类：在建模之前的可解释性方法，建立本身具备可解释性的模型，在建模之后使用可解释性方法对模型作出解释。

《可解释深度强化学习综述》

专知会员服务

40+阅读 · 2025年2月12日

《可解释人工智能（XAI）: 数据挖掘视角》最新综述

专知会员服务

54+阅读 · 2024年1月11日

好的知识蒸馏架构是什么样的？蒙特利尔麦吉尔大学最新《知识学习的师生架构》综述论文，12页pdf详述知识蒸馏师生体系结构体系

专知会员服务

37+阅读 · 2022年11月1日

《可解释深度学习：指南》2022亚马逊等68页论文

专知会员服务

60+阅读 · 2022年10月31日