Jina Embeddings constitutes a set of high-performance sentence embedding models adept at translating various textual inputs into numerical representations, thereby capturing the semantic essence of the text. The models excel in applications such as dense retrieval and semantic textual similarity. This paper details the development of Jina Embeddings, starting with the creation of high-quality pairwise and triplet datasets. It underlines the crucial role of data cleaning in dataset preparation, gives in-depth insights into the model training process, and concludes with a comprehensive performance evaluation using the Massive Textual Embedding Benchmark (MTEB). To increase the model's awareness of negations, we constructed a novel training and evaluation dataset of negated and non-negated statements, which we make publicly available to the community.
翻译:Jina Embeddings 是一套高性能的句子嵌入模型,能够将多种文本输入转换为数值表征,从而捕捉文本的语义核心。这些模型在密集检索和语义文本相似性等应用场景中表现优异。本文详细阐述了Jina Embeddings的开发过程,首先从构建高质量的成对和三元组数据集入手,强调了数据清洗在数据集准备中的关键作用,并深入剖析了模型训练流程,最终基于大规模文本嵌入基准(MTEB)进行了全面性能评估。为提升模型对否定语义的识别能力,我们构建了一个包含否定与非否定陈述的新型训练与评估数据集,并已向社区公开提供。