Granite Embedding Multilingual R2 Models

Parul Awasthy,Aashka Trivedi,Yushu Yang,Ken Barker,Yulong Li,Bhavani Iyer,Martin Franz,Meet Doshi,Riyaz Bhat,Vignesh P,Vishwajeet Kumar,Todd Ward,Abraham Daniels,Rudra Murthy,Madison Lee,Luis Lastras,Jaydeep Sen,Radu Florian

We introduce the multilingual Granite Embedding R2 models, a family of encoder-based embedding models for enterprise-scale dense retrieval across 200+ languages. Extending our English-focused R2 release, these models add enhanced support for 52 languages and programming code, a 32,768-token context window (a 64x expansion over R1), and state-of-the-art overall performance across multilingual and cross-lingual text search, code retrieval, long-document search, and reasoning retrieval datasets. The release consists of two bi-encoder models based on the ModernBERT architecture with an expanded multilingual vocabulary: a 311M-parameter full-size, and a 97M-parameter compact model built via model pruning and vocabulary selection that achieves the highest retrieval score of any open multilingual embedding model under 100M parameters. The full-size also supports Matryoshka Representation Learning for flexible embedding dimensionality. Both models are trained on enterprise-appropriate data with governance oversight, and released under the Apache 2.0 license at https://huggingface.co/collections/ibm-granite, designed to support responsible use and enable unrestricted research and enterprise adoption.

翻译：我们介绍了多语言Granite嵌入R2模型系列，这是一系列基于编码器的嵌入模型，面向200多种语言的密集检索，可应用于企业级场景。在原有聚焦英语的R2版本基础上，这些模型增强了对52种语言及编程代码的支持，拥有32,768词元的上下文窗口（相比R1扩展了64倍），并在多语言与跨语言文本搜索、代码检索、长文档搜索及推理检索数据集上展现出整体领先的性能。本系列包含两个基于ModernBERT架构、采用扩展多语言词表的双编码器模型：一个3.11亿参数的全尺寸模型，以及一个通过模型剪枝和词表选择构建的9700万参数紧凑模型——后者在参数规模低于1亿的所有开源多语言嵌入模型中取得了最高检索分数。全尺寸模型还支持Matryoshka表示学习以实现灵活的嵌入维度选择。两个模型均使用受治理监督的企业适用数据训练，并依据Apache 2.0许可证在https://huggingface.co/collections/ibm-granite发布，旨在支持负责任使用，并促进不受限制的研究与企业级采纳。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

多模态大型语言模型：综述

专知会员服务

47+阅读 · 2025年6月14日

《面向遥感的多模态小语言模型——引入思维链推理与GRPO技术》

专知会员服务

27+阅读 · 2025年5月16日

从词向量到多模态嵌入：大型语言模型的技术、应用及未来方向

专知会员服务

45+阅读 · 2024年11月11日