Decoder-based large language models (LLMs) have shown high performance on many tasks in natural language processing. This is also true for sentence embedding learning, where a decoder-based model, PromptEOL, has achieved the best performance on semantic textual similarity (STS) tasks. However, PromptEOL makes great use of fine-tuning with a manually annotated natural language inference (NLI) dataset. We aim to improve sentence embeddings learned in an unsupervised setting by automatically generating an NLI dataset with an LLM and using it to fine-tune PromptEOL. In experiments on STS tasks, the proposed method achieved an average Spearman's rank correlation coefficient of 82.21 with respect to human evaluation, thus outperforming existing methods without using large, manually annotated datasets.
翻译:摘要:基于解码器的大型语言模型(LLMs)在自然语言处理的许多任务中展现出高性能,句子嵌入学习领域亦是如此。其中,基于解码器的模型PromptEOL在语义文本相似度(STS)任务中取得了最佳性能。然而,PromptEOL高度依赖于使用人工标注的自然语言推理(NLI)数据集进行微调。本文旨在通过利用LLM自动生成NLI数据集,并以此微调PromptEOL,从而改进在无监督环境下习得的句子嵌入。在STS任务的实验中,所提方法与人工评价的平均斯皮尔曼等级相关系数达到82.21,超越了未使用大规模人工标注数据集的现有方法。