The last decade of machine learning has seen drastic increases in scale and capabilities. Deep neural networks (DNNs) are increasingly being deployed in the real world. However, they are difficult to analyze, raising concerns about using them without a rigorous understanding of how they function. Effective tools for interpreting them will be important for building more trustworthy AI by helping to identify problems, fix bugs, and improve basic understanding. In particular, "inner" interpretability techniques, which focus on explaining the internal components of DNNs, are well-suited for developing a mechanistic understanding, guiding manual modifications, and reverse engineering solutions. Much recent work has focused on DNN interpretability, and rapid progress has thus far made a thorough systematization of methods difficult. In this survey, we review over 300 works with a focus on inner interpretability tools. We introduce a taxonomy that classifies methods by what part of the network they help to explain (weights, neurons, subnetworks, or latent representations) and whether they are implemented during (intrinsic) or after (post hoc) training. To our knowledge, we are also the first to survey a number of connections between interpretability research and work in adversarial robustness, continual learning, modularity, network compression, and studying the human visual system. We discuss key challenges and argue that the status quo in interpretability research is largely unproductive. Finally, we highlight the importance of future work that emphasizes diagnostics, debugging, adversaries, and benchmarking in order to make interpretability tools more useful to engineers in practical applications.
翻译:过去十年间,机器学习在规模与能力上取得了显著提升。深度神经网络正日益广泛地部署于现实世界,但其难以分析的特性引发了对其缺乏严谨理解便投入应用的担忧。开发有效的解释工具对于构建更可信的AI至关重要——这类工具能帮助识别问题、修复缺陷并深化基础认知。其中,"内部"解释技术聚焦于阐释DNN的内部组件,特别适用于建立机制性理解、指导人工修改以及进行逆向工程。近期大量工作聚焦于DNN可解释性,但快速进展使得对各类方法进行系统化梳理变得困难。本综述回顾了300余篇文献,重点探讨内部解释工具。我们提出一种分类法,依据方法所解释的网络组成部分(权重、神经元、子网络或潜在表征)以及其实现时机(训练期间的内生方法或训练后的后验方法)进行归类。据我们所知,本文首次系统梳理了可解释性研究与对抗鲁棒性、持续学习、模块化、网络压缩及人类视觉系统研究之间的关联。我们探讨关键挑战,指出现有可解释性研究的"主流范式"在很大程度上效率低下。最后,我们强调未来工作应聚焦于诊断、调试、对抗样本及基准测试,以使解释工具在工程实践中对开发者更具实用价值。