Identifying words that impact a task's performance more than others is a challenge in natural language processing. Transformers models have recently addressed this issue by incorporating an attention mechanism that assigns greater attention (i.e., relevance) scores to some words than others. Because of the attention mechanism's high computational cost, transformer models usually have an input-length limitation caused by hardware constraints. This limitation applies to many transformers, including the well-known bidirectional encoder representations of the transformer (BERT) model. In this paper, we examined BERT's attention assignment mechanism, focusing on two questions: (1) How can attention be employed to reduce input length? (2) How can attention be used as a control mechanism for conditional text generation? We investigated these questions in the context of a text classification task. We discovered that BERT's early layers assign more critical attention scores for text classification tasks compared to later layers. We demonstrated that the first layer's attention sums could be used to filter tokens in a given sequence, considerably decreasing the input length while maintaining good test accuracy. We also applied filtering, which uses a compute-efficient semantic similarities algorithm, and discovered that retaining approximately 6\% of the original sequence is sufficient to obtain 86.5\% accuracy. Finally, we showed that we could generate data in a stable manner and indistinguishable from the original one by only using a small percentage (10\%) of the tokens with high attention scores according to BERT's first layer.
翻译:识别对任务性能影响较大的词汇是自然语言处理中的一项挑战。近期,Transformer模型通过引入注意力机制解决了这一问题,该机制能赋予某些词汇更高的注意力(即相关性)分值。由于注意力机制计算成本高昂,Transformer模型通常因硬件限制而存在输入长度约束。这一限制适用于众多Transformer模型,包括著名的双向编码器表示(BERT)模型。本文研究了BERT的注意力分配机制,聚焦两个问题:(1)如何利用注意力机制缩减输入长度?(2)如何将注意力作为条件文本生成的控制机制?我们以文本分类任务为背景展开探究。研究发现,与深层网络相比,BERT浅层网络对文本分类任务分配的注意力分值更为关键。我们证明了第一层的注意力总和可用于过滤给定序列中的词元,在保持良好测试准确率的同时显著缩短输入长度。此外,我们采用一种计算高效的语义相似度算法进行过滤,发现仅保留原始序列约6%的词元即可达到86.5%的准确率。最后,我们证明仅利用BERT第一层注意力分值最高的少量(10%)词元,即可稳定生成与原始数据不可区分的文本数据。