We introduce C-Pack, a package of resources that significantly advance the field of general Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 2) C-MTP is a massive text embedding dataset curated from labeled and unlabeled Chinese corpora for training embedding models. 3) C-TEM is a family of embedding models covering multiple sizes. Our models outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the release. We also integrate and optimize the entire suite of training methods for C-TEM. Along with our resources on general Chinese embedding, we release our data and models for English text embeddings. The English models achieve state-of-the-art performance on MTEB benchmark; meanwhile, our released English data is 2 times larger than the Chinese data. All these resources are made publicly available at https://github.com/FlagOpen/FlagEmbedding.
翻译:我们推出C-Pack,一组显著推进通用中文嵌入领域的资源包。C-Pack包含三项核心资源:1) C-MTEB——覆盖6项任务和35个数据集的综合性中文文本嵌入基准;2) C-MTP——从标注与未标注中文语料库中精选的大规模文本嵌入训练数据集;3) C-TEM——涵盖多种尺寸的嵌入模型系列。我们的模型在发布时于C-MTEB基准上超越所有此前中文文本嵌入模型,性能提升最高达10%。我们还对C-TEM的整套训练方法进行了集成与优化。除通用中文嵌入资源外,我们同时发布了英文文本嵌入的数据与模型:其英文模型在MTEB基准上达到最先进水平,且英文数据规模为中文数据集的两倍。上述所有资源均已开源发布于https://github.com/FlagOpen/FlagEmbedding。