Jina Embeddings constitutes a set of high-performance sentence embedding models adept at translating textual inputs into numerical representations, capturing the semantics of the text. These models excel in applications like dense retrieval and semantic textual similarity. This paper details the development of Jina Embeddings, starting with the creation of high-quality pairwise and triplet datasets. It underlines the crucial role of data cleaning in dataset preparation, offers in-depth insights into the model training process, and concludes with a comprehensive performance evaluation using the Massive Text Embedding Benchmark (MTEB). Furthermore, to increase the model's awareness of grammatical negation, we construct a novel training and evaluation dataset of negated and non-negated statements, which we make publicly available to the community.
翻译:Jina Embeddings 是一组高性能句子嵌入模型,能够将文本输入转化为数值表示,并捕捉文本的语义信息。这些模型在密集检索和语义文本相似度等任务中表现出色。本文详细介绍了 Jina Embeddings 的开发过程,首先从构建高质量的成对和三元组数据集入手。文章强调了数据清洗在数据集准备中的关键作用,深入阐述了模型训练流程,并最终利用大规模文本嵌入基准(MTEB)进行了全面的性能评估。此外,为增强模型对语法否定的感知能力,我们构建了一个包含否定与非否定语句的新型训练与评估数据集,并将其公开提供给社区使用。