We introduce Gemini Embedding 2, a native multimodal embedding model that allows embedding video, audio, image, and text modalities in a unified representation space. We leverage the multimodal capabilities of Gemini to produce embeddings for arbitrary combinations of interleaved inputs across all these modalities that generalize well across a wide variety of tasks. Applying large-scale contrastive learning in a multi-task multi-stage training setup, we achieve state-of-the-art performance on key embedding benchmarks including unimodal, cross-modal, and multimodal retrieval spanning a diverse set of tasks. We show that our embedding model demonstrates strong performance (with a score of 62.9 R@1 on MSCOCO, 68.8 NDCG@10 on Vatex, 69.9 on MTEB multilingual and 84.0 on MTEB Code) across a variety of tasks surpassing the performance of specialized models. These unified capabilities make Gemini Embedding 2 a promising candidate for downstream use cases such as RAG, recommendation and search. Furthermore, its robust zero-shot performance across distinct fields - from astronomy and bioscience to fine arts and the culinary arts - establishes it as a highly reliable, out-of-the-box representation even for specialized domains.
翻译:我们提出了Gemini嵌入模型2,这是一种原生多模态嵌入模型,能够将视频、音频、图像和文本模态嵌入到统一的表示空间中。我们利用Gemini的多模态能力,为所有这些模态的任意组合交叠输入生成嵌入,这些嵌入在广泛的任务中具有良好的泛化性能。通过在多任务多阶段训练框架中应用大规模对比学习,我们在涵盖多种任务的关键嵌入基准测试(包括单模态、跨模态和多模态检索)中取得了最先进的性能。我们展示了该嵌入模型在各类任务中的强劲表现(在MSCOCO上R@1得分为62.9,Vatex上NDCG@10得分为68.8,MTEB多语言任务上得分为69.9,MTEB代码任务上得分为84.0),超越了专门模型的性能。这种统一能力使Gemini嵌入模型2成为RAG、推荐和搜索等下游应用场景的理想候选方案。此外,它在天文学、生物科学、美术和烹饪艺术等不同领域的稳健零样本性能,确立了其作为高度可靠、即开即用的表示工具,甚至适用于专业领域。