Transformer-based language models (LMs) create hidden representations of their inputs at every layer, but only use final-layer representations for prediction. This obscures the internal decision-making process of the model and the utility of its intermediate representations. One way to elucidate this is to cast the hidden representations as final representations, bypassing the transformer computation in-between. In this work, we suggest a simple method for such casting, by using linear transformations. We show that our approach produces more accurate approximations than the prevailing practice of inspecting hidden representations from all layers in the space of the final layer. Moreover, in the context of language modeling, our method allows "peeking" into early layer representations of GPT-2 and BERT, showing that often LMs already predict the final output in early layers. We then demonstrate the practicality of our method to recent early exit strategies, showing that when aiming, for example, at retention of 95% accuracy, our approach saves additional 7.9% layers for GPT-2 and 5.4% layers for BERT, on top of the savings of the original approach. Last, we extend our method to linearly approximate sub-modules, finding that attention is most tolerant to this change.
翻译:基于Transformer的语言模型(LM)在每一层都会生成输入的隐藏表征,但仅使用最后一层的表征进行预测。这模糊了模型的内部决策过程及其中间表征的效用。阐明这一问题的一种方法是将隐藏表征视为最终表征,跳过其间Transformer的计算过程。本文提出一种简单的映射方法,即通过线性变换实现上述过程。研究表明,与当前在最终层空间中检查所有层隐藏表征的常见做法相比,我们的方法能产生更精确的近似结果。此外,在语言建模场景下,该方法允许"窥见"GPT-2和BERT早期层的表征,显示语言模型往往在早期层就已预测出最终输出。我们进一步展示了该方法对近期早退策略的实用性:例如在保持95%准确率的目标下,相比原始方法,我们的方法可为GPT-2额外节省7.9%的层数,为BERT节省5.4%的层数。最后,我们将该方法扩展至子模块的线性近似,发现注意力机制对此类变换的容忍度最高。