Inner Interpretability is a promising emerging field tasked with uncovering the inner mechanisms of AI systems, though how to develop these mechanistic theories is still much debated. Moreover, recent critiques raise issues that question its usefulness to advance the broader goals of AI. However, it has been overlooked that these issues resemble those that have been grappled with in another field: Cognitive Neuroscience. Here we draw the relevant connections and highlight lessons that can be transferred productively between fields. Based on these, we propose a general conceptual framework and give concrete methodological strategies for building mechanistic explanations in AI inner interpretability research. With this conceptual framework, Inner Interpretability can fend off critiques and position itself on a productive path to explain AI systems.
翻译:内部可解释性是一个前景广阔的新兴领域,其任务是揭示人工智能系统的内部机制,尽管如何发展这些机制性理论仍存在广泛争议。此外,近期批评意见提出的问题对其推动AI更广泛目标的有用性提出了质疑。然而,这些被忽视的问题与另一个领域——认知神经科学——长期应对的困境具有相似性。本文梳理了相关关联,并强调可在学科间有效迁移的经验教训。基于这些分析,我们提出了一个通用概念框架,并为AI内部可解释性研究构建机制性解释提供了具体方法论策略。借助此概念框架,内部可解释性领域能够有效回应批评,并将自身置于解释AI系统的建设性发展路径上。