Learned representations at the level of characters, sub-words, words and sentences, have each contributed to advances in understanding different NLP tasks and linguistic phenomena. However, learning textual embeddings is costly as they are tokenization specific and require different models to be trained for each level of abstraction. We introduce a novel language representation model which can learn to compress to different levels of abstraction at different layers of the same model. We apply Nonparametric Variational Information Bottleneck (NVIB) to stacked Transformer self-attention layers in the encoder, which encourages an information-theoretic compression of the representations through the model. We find that the layers within the model correspond to increasing levels of abstraction and that their representations are more linguistically informed. Finally, we show that NVIB compression results in a model which is more robust to adversarial perturbations.
翻译:在字符、子词、词和句子等不同级别的学习表示,各自推动了不同自然语言处理任务和语言现象理解的进展。然而,学习文本嵌入成本高昂,因为这些嵌入是特定于分词处理的,且需要为每个抽象级别训练不同的模型。我们提出了一种新颖的语言表示模型,该模型能够在同一模型的不同层上学习压缩到不同的抽象级别。我们将非参数变分信息瓶颈(NVIB)应用于编码器中的堆叠Transformer自注意力层,这促使模型对表示进行信息论意义上的压缩。我们发现,模型中的层对应着逐渐增加的抽象级别,且其表示更具语言学信息。最后,我们证明NVIB压缩得到的模型对对抗扰动具有更强的鲁棒性。