Granite Embedding Multilingual R2 Models

Parul Awasthy,Aashka Trivedi,Yushu Yang,Ken Barker,Yulong Li,Bhavani Iyer,Martin Franz,Juergen Bross,Meet Doshi,Vignesh P,Vishwajeet Kumar,Todd Ward,Abraham Daniels,Madison Lee,Luis Lastras,Jaydeep Sen,Radu Florian

We introduce the multilingual Granite Embedding R2 models, a family of encoder-based embedding models for enterprise-scale dense retrieval across 200+ languages. Extending our English-focused R2 release, these models add enhanced support for 52 languages and programming code, a 32,768-token context window (a 64x expansion over R1), and state-of-the-art overall performance across multilingual and cross-lingual text search, code retrieval, long-document search, and reasoning retrieval datasets. The release consists of two bi-encoder models based on the ModernBERT architecture with an expanded multilingual vocabulary: a 311M-parameter full-size, and a 97M-parameter compact model built via model pruning and vocabulary selection that achieves the highest retrieval score of any open multilingual embedding model under 100M parameters. The full-size also supports Matryoshka Representation Learning for flexible embedding dimensionality. Both models are trained on enterprise-appropriate data with governance oversight, and released under the Apache 2.0 license at https://huggingface.co/collections/ibm-granite, designed to support responsible use and enable unrestricted research and enterprise adoption.

翻译：我们推出多语言Granite Embedding R2模型系列，这是一类基于编码器的嵌入模型，面向200+语言的企业级稠密检索。该系列在英语专用R2版本基础上，增强支持52种语言及编程代码，采用32768词元的上下文窗口（较R1模型扩展64倍），并在多语言与跨语言文本检索、代码检索、长文档搜索及推理检索数据集上实现整体最优性能。该版本包含两个基于ModernBERT架构的双编码器模型，均采用扩展的多语言词表：参数规模为3.11亿的标准版模型，以及通过模型剪枝与词表选择构建的9700万参数紧凑版模型——后者在1亿参数以下的开放多语言嵌入模型中取得最优检索得分。标准版模型同时支持Matryoshka表示学习以实现灵活嵌入维度。两类模型均基于受治理监督的企业适用数据训练，以Apache 2.0许可证发布在https://huggingface.co/collections/ibm-granite，旨在支持负责任使用并促进无限制的研究与企业应用。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

多模态大型语言模型：综述

专知会员服务

47+阅读 · 2025年6月14日

《面向遥感的多模态小语言模型——引入思维链推理与GRPO技术》

专知会员服务

27+阅读 · 2025年5月16日

从词向量到多模态嵌入：大型语言模型的技术、应用及未来方向

专知会员服务

45+阅读 · 2024年11月11日