Argus-Retriever: Vision-LLM Late-Interaction Retrieval with Region-Aware Query-Conditioned MoE for Visual Document Retrieval

Late-interaction vision-language retrievers represent each document page as many visual token embeddings and score queries with MaxSim. In systems such as ColPali, ColQwen, ColNomic, and Nemotron ColEmbed, the document embeddings are produced without seeing the query, so the same page is represented identically for a table lookup, a chart question, and a layout-sensitive evidence request. We introduce \textbf{Argus}, a family of query-conditioned late-interaction retrievers built on Qwen3.5-VL. Argus adds a region-aware Mixture-of-Experts module: the query encoder produces both retrieval embeddings and a compact context vector, the document page is pooled into spatial regions, and a query-aware router selects latent experts per region before MaxSim. The output remains a multi-vector index compatible with ColPali-style retrieval, but the document representation is now dependent on the query (i.e., $\mathbf{D}(q)$). All Argus models use a 1024-dimensional retrieval head, compared with the 2560-dimensional and 4096-dimensional heads of recent state-of-the-art systems, and are trained on roughly 9\% of the available public supervision rather than the full pool. The 9B model reaches \textbf{92.67} NDCG@5 on ViDoRe V1 and \textbf{86.0} NDCG@5 on the combined V1+V2 leaderboard, the highest reported value for an open late-interaction model on the combined leaderboard. Wrapped in a Qwen3.6-27B agentic retrieval pipeline on ViDoRe V3, Argus-9B further improves its NDCG@10 from 60.28 to \textbf{64.80} over public tasks, showing that the same retriever serves both as a strong standalone system and as a search primitive for iterative LLM agents.

翻译：后期交互视觉-语言检索器将每个文档页面表示为众多视觉标记嵌入，并通过最大相似度（MaxSim）对查询进行评分。在ColPali、ColQwen、ColNomic和Nemotron ColEmbed等系统中，文档嵌入的生成不依赖查询，因此同一页面对表格查找、图表问题和布局敏感证据请求呈现完全相同的表征。本文提出\textbf{Argus}系列——基于Qwen3.5-VL构建的查询条件化后期交互检索器。Argus引入了区域感知混合专家模块：查询编码器同时生成检索嵌入和紧凑上下文向量，文档页面被池化为空间区域，查询感知路由器在最大相似度计算前为每个区域选择潜在专家。输出仍保持与ColPali风格检索兼容的多向量索引，但文档表征现已依赖查询（即$\mathbf{D}(q)$）。所有Argus模型均使用1024维检索头部，而近期最先进系统分别采用2560维和4096维头部，且仅使用约9%的公开监督数据进行训练（而非全部数据池）。9B模型在ViDoRe V1上达到\textbf{92.67} NDCG@5，在V1+V2综合排行榜上达到\textbf{86.0} NDCG@5，这是开源后期交互模型在综合排行榜上的最高记录。在ViDoRe V3上，通过集成于Qwen3.6-27B代理检索流水线，Argus-9B在公共任务上的NDCG@10从60.28进一步提升至\textbf{64.80}，表明同一检索器既能作为强健的独立系统，也能作为迭代式LLM代理的搜索基元。