Mechanistic interpretability (MI) aims to understand AI models by reverse-engineering the exact algorithms neural networks learn. Most works in MI so far have studied behaviors and capabilities that are trivial and token-aligned. However, most capabilities are not that trivial, which advocates for the study of hidden representations inside these networks as the unit of analysis. We do a literature review, formalize representations for features and behaviors, highlight their importance and evaluation, and perform some basic exploration in the mechanistic interpretability of representations. With discussion and exploratory results, we justify our position that studying representations is an important and under-studied field, and that currently established methods in MI are not sufficient to understand representations, thus pushing for the research community to work toward new frameworks for studying representations.
翻译:机械解释性(MI)旨在通过逆向工程理解神经网络所学习的确切算法,从而揭示AI模型的工作原理。目前,大多数MI研究关注的是琐碎且与词元对齐的行为与能力。然而,多数能力并不如此简单,这促使我们以网络内部的隐藏表征作为分析单元进行研究。我们通过文献综述,形式化地定义了特征与行为的表征,强调了其重要性及评估方法,并对表征的机械解释性进行了初步探索。基于讨论与探索性结果,我们论证了以下立场:研究表征是一个重要但尚未充分开发的领域,而当前MI领域公认的方法不足以理解表征,因此呼吁研究界致力于构建研究表征的新框架。