Learning better sentence embeddings leads to improved performance for natural language understanding tasks including semantic textual similarity (STS) and natural language inference (NLI). As prior studies leverage large-scale labeled NLI datasets for fine-tuning masked language models to yield sentence embeddings, task performance for languages other than English is often left behind. In this study, we directly compared two data augmentation techniques as potential solutions for monolingual STS: (a) cross-lingual transfer that exploits English resources alone as training data to yield non-English sentence embeddings as zero-shot inference, and (b) machine translation that coverts English data into pseudo non-English training data in advance. In our experiments on monolingual STS in Japanese and Korean, we find that the two data techniques yield performance on par. Rather, we find a superiority of the Wikipedia domain over the NLI domain for these languages, in contrast to prior studies that focused on NLI as training data. Combining our findings, we demonstrate that the cross-lingual transfer of Wikipedia data exhibits improved performance, and that native Wikipedia data can further improve performance for monolingual STS.
翻译:学习更优质的句子嵌入能提升自然语言理解任务的性能,包括语义文本相似性(STS)和自然语言推理(NLI)。由于先前研究利用大规模标注NLI数据集对掩码语言模型进行微调以生成句子嵌入,非英语语言的任务性能往往落后。本研究直接比较了两种作为单语言STS潜在解决方案的数据增强技术:(a)跨语言迁移,仅利用英语资源作为训练数据,通过零样本推理生成非英语句子嵌入;(b)机器翻译,预先将英语数据转化为伪非英语训练数据。我们在日语和韩语单语言STS实验中发现,这两种技术性能相当。相反,我们观察到对于这些语言,维基百科领域优于NLI领域,这与先前聚焦NLI作为训练数据的研究形成对比。结合研究发现,我们证明维基百科数据的跨语言迁移能呈现更优性能,且原生维基百科数据可进一步提升单语言STS的表现。