In recent years, many interpretability methods have been proposed to help interpret the internal states of Transformer-models, at different levels of precision and complexity. Here, to analyze encoder-decoder Transformers, we propose a simple, new method: DecoderLens. Inspired by the LogitLens (for decoder-only Transformers), this method involves allowing the decoder to cross-attend representations of intermediate encoder layers instead of using the final encoder output, as is normally done in encoder-decoder models. The method thus maps previously uninterpretable vector representations to human-interpretable sequences of words or symbols. We report results from the DecoderLens applied to models trained on question answering, logical reasoning, speech recognition and machine translation. The DecoderLens reveals several specific subtasks that are solved at low or intermediate layers, shedding new light on the information flow inside the encoder component of this important class of models.
翻译:近年来,研究者提出了多种具有不同精度与复杂度的可解释性方法,用于解释Transformer模型的内部状态。本文针对编码器-解码器Transformer提出了一种简洁的新型方法——DecoderLens。该方法受LogitLens(适用于仅含解码器的Transformer)启发,通过允许解码器交叉关注编码器中间层的表征,而非像常规编码器-解码器模型那样使用最终编码器输出,从而将先前无法解释的向量表征映射为人类可理解的词语或符号序列。我们报告了将该方法应用于问答、逻辑推理、语音识别和机器翻译任务训练模型的结果。DecoderLens揭示了多个在低层或中间层解决的特定子任务,为理解该类重要模型中编码器组件内部的信息流动提供了新视角。