Retrieval-Augmented Generation (RAG) systems have been popular for generative applications, powering language models by injecting external knowledge. Companies have been trying to leverage their large catalog of documents (e.g. PDFs, presentation slides) in such RAG pipelines, whose first step is the retrieval component. Dense retrieval has been a popular approach, where embedding models are used to generate a dense representation of the user query that is closer to relevant content embeddings. More recently, VLM-based embedding models have become popular for visual document retrieval, as they preserve visual information and simplify the indexing pipeline compared to OCR text extraction. Motivated by the growing demand for visual document retrieval, we introduce Nemotron ColEmbed V2, a family of models that achieve state-of-the-art performance on the ViDoRe benchmarks. We release three variants - with 3B, 4B, and 8B parameters - based on pre-trained VLMs: NVIDIA Eagle 2 with Llama 3.2 3B backbone, Qwen3-VL-4B-Instruct and Qwen3-VL-8B-Instruct, respectively. The 8B model ranks first on the ViDoRe V3 leaderboard as of February 03, 2026, achieving an average NDCG@10 of 63.42. We describe the main techniques used across data processing, training, and post-training - such as cluster-based sampling, hard-negative mining, bidirectional attention, late interaction, and model merging - that helped us build our top-performing models. We also discuss compute and storage engineering challenges posed by the late interaction mechanism and present experiments on how to balance accuracy and storage with lower dimension embeddings.
翻译:摘要:检索增强生成系统在生成式应用中广受欢迎,通过注入外部知识为语言模型赋能。企业尝试在其大规模文档库(如PDF、演示文稿)中应用此类RAG流程,而检索组件正是该流程的首要步骤。稠密检索作为一种主流方法,通过嵌入模型生成用户查询的稠密表示,使其与相关内容的嵌入向量更接近。近年来,基于视觉语言模型的嵌入模型在视觉文档检索领域日益流行,这类模型能够保留视觉信息并简化索引流程(相较于OCR文本提取)。为满足日益增长的视觉文档检索需求,我们推出了Nemotron ColEmbed V2系列模型,在ViDoRe基准测试中实现了最先进性能。我们发布了三种参数规模的变体(3B、4B、8B参数),分别基于预训练VLM:NVIDIA Eagle 2(主干网络为Llama 3.2 3B)、Qwen3-VL-4B-Instruct和Qwen3-VL-8B-Instruct。截至2026年2月3日,8B模型在ViDoRe V3排行榜上排名第一,平均NDCG@10得分达63.42。本文详细阐述了数据预处理、训练及后训练阶段采用的核心技术——包括基于聚类的采样、难负样本挖掘、双向注意力机制、延迟交互及模型融合——这些技术共同助力我们构建了性能卓越的模型。同时,我们探讨了延迟交互机制带来的计算与存储工程挑战,并通过实验展示如何在低维嵌入条件下实现精度与存储的平衡。