Due to the lack of a large collection of high-quality labeled sentence pairs with textual similarity scores, existing approaches for Semantic Textual Similarity (STS) mostly rely on unsupervised techniques or training signals that are only partially correlated with textual similarity, e.g., NLI-based datasets. To tackle this issue, in this paper, we propose the strategy of measuring text similarity via GPT annotated data (Sim-GPT for short). The core idea of Sim-GPT is to generate data with STS labels using GPT-4, based on which an STS model is trained. Sim-GPT framework utilizes LLMs to provide a substantial amount of reliable annotated data filling the gap of the lack of training signals for STS. Sim-GPT is trained on a one-time generated dataset using BERT or RoBERTa as the backbone, which offers long-term savings in cost and speed compared to repeatedly invoking LLMs for each sentence pair. Trained on the examples from GPT-4 (371K), Sim-GPT yields SOTA performances on the widely-used seven STS benchmarks: +0.99 over supervised-SimCSE, and +0.42 over the current SOTA PromCSE model. To encourage further advancements of the field, we release both models and the 371K annotated examples from GPT-4. Code, models and annotated data are available at: https://github.com/ShuheWang1998/Sim-GPT.
翻译:由于缺乏大规模高质量带有文本相似度分数的标注句对,现有的语义文本相似度(STS)方法大多依赖无监督技术或仅与文本相似度部分相关的训练信号(例如基于NLI的数据集)。为解决这一问题,本文提出了一种基于GPT标注数据进行文本相似度度量的策略(简称Sim-GPT)。Sim-GPT的核心思想是利用GPT-4生成带有STS标签的数据,并据此训练STS模型。该框架借助大语言模型提供大量可靠的标注数据,填补了STS训练信号缺失的空白。Sim-GPT采用BERT或RoBERTa作为主干网络,在一次性生成的标注数据集上训练,相比为每个句对反复调用大语言模型,具有长期成本与速度优势。基于GPT-4生成的371K样本训练后,Sim-GPT在广泛使用的七个STS基准上取得了最佳性能:相较于有监督SimCSE提升0.99个点,相较当前最优模型PromCSE提升0.42个点。为促进该领域进一步发展,我们公开了模型及GPT-4生成的371K标注样本。相关代码、模型与标注数据可通过https://github.com/ShuheWang1998/Sim-GPT 获取。