Nemotron ColEmbed V2: Top-Performing Late Interaction embedding models for Visual Document Retrieval

Gabriel de Souza P. Moreira,Ronay Ak,Mengyao Xu,Oliver Holworthy,Benedikt Schifferer,Zhiding Yu,Yauhen Babakhin,Radek Osmulski,Jiarui Cai,Ryan Chesler,Bo Liu,Even Oldridge

Retrieval-Augmented Generation (RAG) systems have been popular for generative applications, powering language models by injecting external knowledge. Companies have been trying to leverage their large catalog of documents (e.g. PDFs, presentation slides) in such RAG pipelines, whose first step is the retrieval component. Dense retrieval has been a popular approach, where embedding models are used to generate a dense representation of the user query that is closer to relevant content embeddings. More recently, VLM-based embedding models have become popular for visual document retrieval, as they preserve visual information and simplify the indexing pipeline compared to OCR text extraction. Motivated by the growing demand for visual document retrieval, we introduce Nemotron ColEmbed V2, a family of models that achieve state-of-the-art performance on the ViDoRe benchmarks. We release three variants - with 3B, 4B, and 8B parameters - based on pre-trained VLMs: NVIDIA Eagle 2 with Llama 3.2 3B backbone, Qwen3-VL-4B-Instruct and Qwen3-VL-8B-Instruct, respectively. The 8B model ranks first on the ViDoRe V3 leaderboard as of February 03, 2026, achieving an average NDCG@10 of 63.42. We describe the main techniques used across data processing, training, and post-training - such as cluster-based sampling, hard-negative mining, bidirectional attention, late interaction, and model merging - that helped us build our top-performing models. We also discuss compute and storage engineering challenges posed by the late interaction mechanism and present experiments on how to balance accuracy and storage with lower dimension embeddings.

翻译：检索增强生成（RAG）系统在生成式应用中广受欢迎，它通过注入外部知识来增强语言模型的能力。各公司一直尝试在其庞大的文档库（例如PDF、演示文稿幻灯片）中利用此类RAG流程，其第一步是检索组件。密集检索是一种流行的方法，其中使用嵌入模型生成用户查询的密集表示，使其更接近相关内容嵌入。最近，基于视觉语言模型（VLM）的嵌入模型在视觉文档检索中变得流行，因为它们保留了视觉信息，并且与OCR文本提取相比简化了索引流程。受视觉文档检索需求不断增长的推动，我们推出了Nemotron ColEmbed V2模型系列，该系列在ViDoRe基准测试中实现了最先进的性能。我们发布了三个变体——分别具有30亿、40亿和80亿参数——它们基于预训练的VLM构建：分别是具有Llama 3.2 3B骨干的NVIDIA Eagle 2、Qwen3-VL-4B-Instruct和Qwen3-VL-8B-Instruct。截至2026年2月3日，80亿参数模型在ViDoRe V3排行榜上排名第一，实现了平均NDCG@10为63.42。我们描述了在数据处理、训练和后训练过程中使用的主要技术——例如基于聚类的采样、困难负样本挖掘、双向注意力、延迟交互和模型融合——这些技术帮助我们构建了这些顶级性能模型。我们还讨论了延迟交互机制带来的计算和存储工程挑战，并介绍了如何通过较低维度的嵌入来平衡准确性与存储的实验。