Due to the lack of a large collection of high-quality labeled sentence pairs with textual similarity scores, existing approaches for Semantic Textual Similarity (STS) mostly rely on unsupervised techniques or training signals that are only partially correlated with textual similarity, e.g., NLI-based datasets. To tackle this issue, in this paper, we propose the strategy of measuring text similarity via GPT annotated data (Sim-GPT for short). The core idea of Sim-GPT is to generate data with STS labels using GPT-4, based on which an STS model is trained. Sim-GPT framework utilizes LLMs to provide a substantial amount of reliable annotated data filling the gap of the lack of training signals for STS. Sim-GPT is trained on a one-time generated dataset using BERT or RoBERTa as the backbone, which offers long-term savings in cost and speed compared to repeatedly invoking LLMs for each sentence pair. Trained on the examples from GPT-4 (371K), Sim-GPT yields SOTA performances on the widely-used seven STS benchmarks: +0.99 over supervised-SimCSE, and +0.42 over the current SOTA PromCSE model. To encourage further advancements of the field, we release both models and the 371K annotated examples from GPT-4. Code, models and annotated data are available at: https://github.com/ShuheWang1998/Sim-GPT.
翻译:由于缺乏大规模高质量带有文本相似度分数的标注句子对,现有的语义文本相似度(STS)方法大多依赖于无监督技术或仅与文本相似度部分相关的训练信号(例如基于自然语言推理的数据集)。为解决这一问题,本文提出通过GPT标注数据测量文本相似度的策略(简称Sim-GPT)。Sim-GPT的核心思想是使用GPT-4生成带有STS标签的数据,并基于此训练STS模型。Sim-GPT框架利用大语言模型提供大量可靠的标注数据,填补了STS训练信号缺失的空白。该模型基于一次性生成的BERT或RoBERTa骨干数据集进行训练,与为每个句子对重复调用大语言模型相比,在成本与速度上具有长期优势。通过GPT-4生成的37.1万条样本训练后,Sim-GPT在广泛使用的七个STS基准测试中取得最优性能:相较有监督SimCSE提升0.99,相较当前最优PromCSE模型提升0.42。为促进领域进一步发展,我们开源了模型及GPT-4生成的37.1万条标注样例。代码、模型及标注数据可于https://github.com/ShuheWang1998/Sim-GPT获取。