We introduce the multilingual Granite Embedding R2 models, a family of encoder-based embedding models for enterprise-scale dense retrieval across 200+ languages. Extending our English-focused R2 release, these models add enhanced support for 52 languages and programming code, a 32,768-token context window (a 64x expansion over R1), and state-of-the-art overall performance across multilingual and cross-lingual text search, code retrieval, long-document search, and reasoning retrieval datasets. The release consists of two bi-encoder models based on the ModernBERT architecture with an expanded multilingual vocabulary: a 311M-parameter full-size, and a 97M-parameter compact model built via model pruning and vocabulary selection that achieves the highest retrieval score of any open multilingual embedding model under 100M parameters. The full-size also supports Matryoshka Representation Learning for flexible embedding dimensionality. Both models are trained on enterprise-appropriate data with governance oversight, and released under the Apache 2.0 license at https://huggingface.co/collections/ibm-granite, designed to support responsible use and enable unrestricted research and enterprise adoption.
翻译:我们介绍了多语言Granite嵌入R2模型系列,这是一系列基于编码器的嵌入模型,面向200多种语言的密集检索,可应用于企业级场景。在原有聚焦英语的R2版本基础上,这些模型增强了对52种语言及编程代码的支持,拥有32,768词元的上下文窗口(相比R1扩展了64倍),并在多语言与跨语言文本搜索、代码检索、长文档搜索及推理检索数据集上展现出整体领先的性能。本系列包含两个基于ModernBERT架构、采用扩展多语言词表的双编码器模型:一个3.11亿参数的全尺寸模型,以及一个通过模型剪枝和词表选择构建的9700万参数紧凑模型——后者在参数规模低于1亿的所有开源多语言嵌入模型中取得了最高检索分数。全尺寸模型还支持Matryoshka表示学习以实现灵活的嵌入维度选择。两个模型均使用受治理监督的企业适用数据训练,并依据Apache 2.0许可证在https://huggingface.co/collections/ibm-granite发布,旨在支持负责任使用,并促进不受限制的研究与企业级采纳。