Prior work on multilingual sentence embedding has demonstrated that the efficient use of natural language inference (NLI) data to build high-performance models can outperform conventional methods. However, the potential benefits from the recent ``exponential'' growth of language models with billions of parameters have not yet been fully explored. In this paper, we introduce Multilingual Sentence T5 (m-ST5), as a larger model of NLI-based multilingual sentence embedding, by extending Sentence T5, an existing monolingual model. By employing the low-rank adaptation (LoRA) technique, we have achieved a successful scaling of the model's size to 5.7 billion parameters. We conducted experiments to evaluate the performance of sentence embedding and verified that the method outperforms the NLI-based prior approach. Furthermore, we also have confirmed a positive correlation between the size of the model and its performance. It was particularly noteworthy that languages with fewer resources or those with less linguistic similarity to English benefited more from the parameter increase. Our model is available at https://huggingface.co/pkshatech/m-ST5.
翻译:多语言语句嵌入的先前研究已证明,高效利用自然语言推断(NLI)数据构建高性能模型可超越传统方法。然而,近年来拥有数十亿参数的语言模型呈"指数级"增长,其潜在优势尚未得到充分探索。本文通过扩展现有单语模型Sentence T5,引入基于NLI的多语言语句嵌入大模型——多语言语句T5(m-ST5)。采用低秩适配(LoRA)技术后,我们成功将模型规模扩展至57亿参数。通过语句嵌入性能评估实验,验证了该方法优于基于NLI的先前方法。此外,我们还确认了模型规模与性能之间的正相关性。特别值得注意的是,资源较少的语言或与英语语言相似度较低的语言,其参数增加带来的收益更为显著。我们的模型已发布于 https://huggingface.co/pkshatech/m-ST5。