Dense vector representations for textual data are crucial in modern NLP. Word embeddings and sentence embeddings estimated from raw texts are key in achieving state-of-the-art results in various tasks requiring semantic understanding. However, obtaining embeddings at the document level is challenging due to computational requirements and lack of appropriate data. Instead, most approaches fall back on computing document embeddings based on sentence representations. Although there exist architectures and models to encode documents fully, they are in general limited to English and few other high-resourced languages. In this work, we provide a systematic comparison of methods to produce document-level representations from sentences based on LASER, LaBSE, and Sentence BERT pre-trained multilingual models. We compare input token number truncation, sentence averaging as well as some simple windowing and in some cases new augmented and learnable approaches, on 3 multi- and cross-lingual tasks in 8 languages belonging to 3 different language families. Our task-based extrinsic evaluations show that, independently of the language, a clever combination of sentence embeddings is usually better than encoding the full document as a single unit, even when this is possible. We demonstrate that while a simple sentence average results in a strong baseline for classification tasks, more complex combinations are necessary for semantic tasks.
翻译:文本数据的稠密向量表示在现代自然语言处理中至关重要。从原始文本中估计出的词嵌入和句子嵌入,在需要语义理解的各种任务中取得最先进结果方面起着关键作用。然而,由于计算需求和数据缺乏,在文档级别获得嵌入具有挑战性。相反,大多数方法退而求其次,基于句子表示来计算文档嵌入。尽管存在能够完全编码文档的架构和模型,但它们通常仅限于英语和少数几种高资源语言。在本工作中,我们基于LASER、LaBSE和Sentence BERT预训练多语言模型,系统比较了从句子生成文档级别表示的方法。我们在8种属于3个不同语系的语言上,针对3项多语言和跨语言任务,比较了输入词元数量截断、句子平均以及一些简单的窗口化方法,并在某些情况下比较了新的增强和可学习方法。我们的基于任务的外在评估表明,无论语言如何,对于句子嵌入的巧妙组合通常比将完整文档作为单个单元进行编码更好,即使后者在技术上是可行的。我们证明,虽然简单的句子平均在分类任务中构成了一个强大的基线,但语义任务需要更复杂的组合。