We introduce the multilingual Granite Embedding R2 models, a family of encoder-based embedding models for enterprise-scale dense retrieval across 200+ languages. Extending our English-focused R2 release, these models add enhanced support for 52 languages and programming code, a 32,768-token context window (a 64x expansion over R1), and state-of-the-art overall performance across multilingual and cross-lingual text search, code retrieval, long-document search, and reasoning retrieval datasets. The release consists of two bi-encoder models based on the ModernBERT architecture with an expanded multilingual vocabulary: a 311M-parameter full-size, and a 97M-parameter compact model built via model pruning and vocabulary selection that achieves the highest retrieval score of any open multilingual embedding model under 100M parameters. The full-size also supports Matryoshka Representation Learning for flexible embedding dimensionality. Both models are trained on enterprise-appropriate data with governance oversight, and released under the Apache 2.0 license at https://huggingface.co/collections/ibm-granite, designed to support responsible use and enable unrestricted research and enterprise adoption.
翻译:我们推出多语言Granite Embedding R2模型系列,这是一类基于编码器的嵌入模型,面向200+语言的企业级稠密检索。该系列在英语专用R2版本基础上,增强支持52种语言及编程代码,采用32768词元的上下文窗口(较R1模型扩展64倍),并在多语言与跨语言文本检索、代码检索、长文档搜索及推理检索数据集上实现整体最优性能。该版本包含两个基于ModernBERT架构的双编码器模型,均采用扩展的多语言词表:参数规模为3.11亿的标准版模型,以及通过模型剪枝与词表选择构建的9700万参数紧凑版模型——后者在1亿参数以下的开放多语言嵌入模型中取得最优检索得分。标准版模型同时支持Matryoshka表示学习以实现灵活嵌入维度。两类模型均基于受治理监督的企业适用数据训练,以Apache 2.0许可证发布在https://huggingface.co/collections/ibm-granite,旨在支持负责任使用并促进无限制的研究与企业应用。