Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini

Madhuri Shanbhogue,Zhe Li,Shanfeng Zhang,Gustavo Hernández Ábrego,Shih-Cheng Huang,Aashi Jain,Daniel Salz,Sonam Goenka,Chaitra Hegde,Ji Ma,Feiyang Chen,Jiaxing Wu,Tanmaya Dabral,Babak Samari,Kevin Poulet,Daniel Cer,Kaifeng Chen,Paul Suganathan,Hui Hui,Jovan Andonov,Philippe Schlattner,Jay Han,Iftekhar Naim,Wing Lowe,Vladimir Pchelin,Albert Yang,Yi-Ting Chen,Zhongli Ding,Grace Zhang,Georg Heigold,Yichang Chen,Antoine Reveillon,Brendan Mccloskey,Wenlei Zhou,Dahun Kim,Rui Meng,Emma Wang,Jack Zheng,Halley Fede,Zhen Yang,Keegan Mosley,Brian Potetz,Sahil Dua,Henrique Schechter Vera,Shen Gao,Hesen Zhang,Andreas Hess,Hengxuan Ying,Alberto Montes,Karan Gill,Min Choi,Sebastian Russo,Anja Hauth,Jinhyuk Lee,Michael Boratko,Megan Barnes,Vikram Rao,Claudiu Musat,Cyril Allauzen,Ehsan Variani,Shankar Kumar,Tom Bagby,Junyi Jiao,Yang Gu,Tengxin Li,Ayush Agrawal,Roberto Santana,Dev Nath,Stephen Karukas,Shuoxuan Han,Lucia Loher,Alice Twu,Nidhi Vyas,Siddharth Bhai,Frank Palma Gomez,Wangyuan Zhang,Chaoren Liu,Jizheng Yang,Steve Qiu,Shijie Zhang,Sujay Kulkarni,Sascha Rothe,Sean Nakamoto,Raphael Hoffmann,Zach Gleicher,Yunhsuan Sung,Qin Yin,Tom Duerig,Mojtaba Seyedhosseini

We introduce Gemini Embedding 2, a native multimodal embedding model that allows embedding video, audio, image, and text modalities in a unified representation space. We leverage the multimodal capabilities of Gemini to produce embeddings for arbitrary combinations of interleaved inputs across all these modalities that generalize well across a wide variety of tasks. Applying large-scale contrastive learning in a multi-task multi-stage training setup, we achieve state-of-the-art performance on key embedding benchmarks including unimodal, cross-modal, and multimodal retrieval spanning a diverse set of tasks. We show that our embedding model demonstrates strong performance (with a score of 62.9 R@1 on MSCOCO, 68.8 NDCG@10 on Vatex, 69.9 on MTEB multilingual and 84.0 on MTEB Code) across a variety of tasks surpassing the performance of specialized models. These unified capabilities make Gemini Embedding 2 a promising candidate for downstream use cases such as RAG, recommendation and search. Furthermore, its robust zero-shot performance across distinct fields - from astronomy and bioscience to fine arts and the culinary arts - establishes it as a highly reliable, out-of-the-box representation even for specialized domains.

翻译：我们提出了Gemini嵌入模型2，这是一种原生多模态嵌入模型，能够将视频、音频、图像和文本模态嵌入到统一的表示空间中。我们利用Gemini的多模态能力，为所有这些模态的任意组合交叠输入生成嵌入，这些嵌入在广泛的任务中具有良好的泛化性能。通过在多任务多阶段训练框架中应用大规模对比学习，我们在涵盖多种任务的关键嵌入基准测试（包括单模态、跨模态和多模态检索）中取得了最先进的性能。我们展示了该嵌入模型在各类任务中的强劲表现（在MSCOCO上R@1得分为62.9，Vatex上NDCG@10得分为68.8，MTEB多语言任务上得分为69.9，MTEB代码任务上得分为84.0），超越了专门模型的性能。这种统一能力使Gemini嵌入模型2成为RAG、推荐和搜索等下游应用场景的理想候选方案。此外，它在天文学、生物科学、美术和烹饪艺术等不同领域的稳健零样本性能，确立了其作为高度可靠、即开即用的表示工具，甚至适用于专业领域。

相关内容

Gemini

关注 12

2023年12 月 6 日，谷歌 CEO 桑达尔・皮查伊官宣 Gemini 1.0 版正式上线。这次发布的 Gemini 大模型是原生多模态大模型，是谷歌大模型新时代的第一步，它包括三种量级：能力最强的 Gemini Ultra，适用于多任务的 Gemini Pro 以及适用于特定任务和端侧的 Gemini Nano。

美国启动“自有军事人工智能计划”：采用谷歌Gemini以推动全军人工智能应用

专知会员服务

31+阅读 · 2025年12月16日

Gemini 2.5：推动前沿，具备先进推理、多模态、长上下文及下一代智能体能力

专知会员服务

20+阅读 · 2025年7月13日

MiniMax震撼开源，突破传统Transformer架构，4560亿参数，支持400万长上下文

专知会员服务

21+阅读 · 2025年1月15日

Gemini多模态医疗能力

专知会员服务

31+阅读 · 2024年5月12日