Prediction head is a crucial component of Transformer language models. Despite its direct impact on prediction, this component has often been overlooked in analyzing Transformers. In this study, we investigate the inner workings of the prediction head, specifically focusing on bias parameters. Our experiments with BERT and GPT-2 models reveal that the biases in their word prediction heads play a significant role in the models' ability to reflect word frequency in a corpus, aligning with the logit adjustment method commonly used in long-tailed learning. We also quantify the effect of controlling the biases in practical auto-regressive text generation scenarios; under a particular setting, more diverse text can be generated without compromising text quality.
翻译:预测头是Transformer语言模型的关键组成部分。尽管其对预测结果有直接影响,但在分析Transformer时,这一组件常常被忽视。本研究深入探究了预测头的内部工作机制,特别关注其中的偏置参数。我们对BERT和GPT-2模型进行的实验表明,词预测头中的偏置在模型反映语料库词频的能力中扮演重要角色,这与长尾学习中常用的对数几率调整方法一致。我们还量化了在自回归文本生成实际场景中控制偏置的效果;在特定设置下,可以在不降低文本质量的情况下生成更多样化的文本。