Embedding models are crucial for various natural language processing tasks but can be limited by factors such as limited vocabulary, lack of context, and grammatical errors. This paper proposes a novel approach to improve embedding performance by leveraging large language models (LLMs) to enrich and rewrite input text before the embedding process. By utilizing ChatGPT 3.5 to provide additional context, correct inaccuracies, and incorporate metadata, the proposed method aims to enhance the utility and accuracy of embedding models. The effectiveness of this approach is evaluated on three datasets: Banking77Classification, TwitterSemEval 2015, and Amazon Counter-factual Classification. Results demonstrate significant improvements over the baseline model on the TwitterSemEval 2015 dataset, with the best-performing prompt achieving a score of 85.34 compared to the previous best of 81.52 on the Massive Text Embedding Benchmark (MTEB) Leaderboard. However, performance on the other two datasets was less impressive, highlighting the importance of considering domain-specific characteristics. The findings suggest that LLM-based text enrichment has shown promising results to improve embedding performance, particularly in certain domains. Hence, numerous limitations in the process of embedding can be avoided.
翻译:嵌入模型对于各类自然语言处理任务至关重要,但可能受限于词汇量有限、缺乏上下文以及语法错误等因素。本文提出了一种新颖方法,通过在大语言模型(LLM)处理嵌入过程之前,利用其丰富并重写输入文本,从而提升嵌入性能。该方法利用ChatGPT 3.5提供额外上下文、纠正错误并融入元数据,旨在增强嵌入模型的实用性和准确性。我们在三个数据集上评估了该方法的有效性:Banking77Classification、TwitterSemEval 2015和Amazon反事实分类。结果表明,在TwitterSemEval 2015数据集上,该方法相较于基线模型有显著提升,其中最佳提示词在Massive Text Embedding Benchmark(MTEB)排行榜上取得85.34分,而此前最佳成绩为81.52分。然而,在另外两个数据集上的表现相对平淡,凸显了考虑领域特定特征的重要性。研究结果表明,基于大语言模型的文本丰富方法在提升嵌入性能方面展现出潜力,尤其在某些领域效果显著,从而可避免嵌入过程中的诸多局限性。