Despite being the current de-facto models in most NLP tasks, transformers are often limited to short sequences due to their quadratic attention complexity on the number of tokens. Several attempts to address this issue were studied, either by reducing the cost of the self-attention computation or by modeling smaller sequences and combining them through a recurrence mechanism or using a new transformer model. In this paper, we suggest to take advantage of pre-trained sentence transformers to start from semantically meaningful embeddings of the individual sentences, and then combine them through a small attention layer that scales linearly with the document length. We report the results obtained by this simple architecture on three standard document classification datasets. When compared with the current state-of-the-art models using standard fine-tuning, the studied method obtains competitive results (even if there is no clear best model in this configuration). We also showcase that the studied architecture obtains better results when freezing the underlying transformers. A configuration that is useful when we need to avoid complete fine-tuning (e.g. when the same frozen transformer is shared by different applications). Finally, two additional experiments are provided to further evaluate the relevancy of the studied architecture over simpler baselines.
翻译:尽管Transformer在大多数自然语言处理任务中已成为事实上的标准模型,但由于其对令牌数量的二次注意力复杂度,其应用通常受限于短序列。已有多种尝试解决此问题,包括通过降低自注意力计算成本、建模更短序列并通过循环机制组合它们,或直接使用新型Transformer模型。本文提出利用预训练句子Transformer的优势,从单个句子的语义有意义嵌入出发,再通过一个线性扩展文档长度的小型注意力层进行组合。我们报告了该简单架构在三个标准文档分类数据集上的实验结果。与采用标准微调方法的当前最先进模型相比,所研究的方法获得了具有竞争力的结果(尽管在此配置下没有明确的单一最佳模型)。我们还展示了当冻结底层Transformer时,该架构能获得更好的性能。这一配置在需要避免完全微调时(例如,当同一冻结Transformer被不同应用共享时)尤为有用。最后,通过两项额外实验进一步评估了该架构相对于更简单基线模型的相关性。