The last decade of machine learning has seen drastic increases in scale and capabilities. Deep neural networks (DNNs) are increasingly being deployed in the real world. However, they are difficult to analyze, raising concerns about using them without a rigorous understanding of how they function. Effective tools for interpreting them will be important for building more trustworthy AI by helping to identify problems, fix bugs, and improve basic understanding. In particular, "inner" interpretability techniques, which focus on explaining the internal components of DNNs, are well-suited for developing a mechanistic understanding, guiding manual modifications, and reverse engineering solutions. Much recent work has focused on DNN interpretability, and rapid progress has thus far made a thorough systematization of methods difficult. In this survey, we review over 300 works with a focus on inner interpretability tools. We introduce a taxonomy that classifies methods by what part of the network they help to explain (weights, neurons, subnetworks, or latent representations) and whether they are implemented during (intrinsic) or after (post hoc) training. To our knowledge, we are also the first to survey a number of connections between interpretability research and work in adversarial robustness, continual learning, modularity, network compression, and studying the human visual system. We discuss key challenges and argue that the status quo in interpretability research is largely unproductive. Finally, we highlight the importance of future work that emphasizes diagnostics, debugging, adversaries, and benchmarking in order to make interpretability tools more useful to engineers in practical applications.
翻译:过去十年间,机器学习在规模与能力上取得了巨大飞跃。深度神经网络(DNN)正日益部署于现实世界。然而,这类模型难以分析,引发了对缺乏严谨机理理解而直接应用的担忧。有效的解释工具对构建更可信的AI至关重要,其能辅助识别问题、修复故障并深化基础认知。其中,"内在"解释技术——聚焦于解释DNN内部组件的技术——特别适合发展机制性理解、指导人工修改及逆向工程解决方案。近期大量工作聚焦于DNN可解释性,快速进展导致方法系统化梳理十分困难。本综述审阅了300余篇文献,重点聚焦内在解释工具。我们提出一种分类体系:依据方法解释的网络部分(权重、神经元、子网络或潜在表征)及其实施阶段(训练期间(固有型)或训练之后(事后型))进行分类。据我们所知,本综述首次系统梳理了可解释性研究与对抗鲁棒性、持续学习、模块化、网络压缩及人类视觉系统研究之间的联系。我们探讨了关键挑战,指出当前可解释性研究现状在相当程度上缺乏生产性。最后,我们强调未来工作应重点发展诊断、调试、对抗与基准测试,以提升解释工具在工程实践中的实用性。