As language models are applied to an increasing number of real-world applications, understanding their inner workings has become an important issue in model trust, interpretability, and transparency. In this work we show that representation dissimilarity measures, which are functions that measure the extent to which two model's internal representations differ, can be a valuable tool for gaining insight into the mechanics of language models. Among our insights are: (i) an apparent asymmetry in the internal representations of model using SoLU and GeLU activation functions, (ii) evidence that dissimilarity measures can identify and locate generalization properties of models that are invisible via in-distribution test set performance, and (iii) new evaluations of how language model features vary as width and depth are increased. Our results suggest that dissimilarity measures are a promising set of tools for shedding light on the inner workings of language models.
翻译:随着语言模型在越来越多的实际应用中得到部署,理解其内部工作机制已成为模型可信度、可解释性和透明度的重要议题。本研究显示,表示差异度(即衡量两个模型内部表示差异程度的函数)可成为洞察语言模型运作机制的重要工具。我们的研究发现包括:(i)使用SoLU和GeLU激活函数的模型内部表示存在明显不对称性;(ii)差异度能够识别并定位模型泛化特性,而这些特性通过分布内测试集性能无法察觉;(iii)关于语言模型特征随宽度和深度增加而变化规律的新评估。研究结果表明,差异度是一套有望揭示语言模型内部工作机制的有效工具。