Inference-Free Multimodal Learned Sparse Retrieval for Production-Scale Visual Document Search

As large-scale visual-document corpora such as arXiv papers and enterprise PDFs continue to grow, visual-document retrieval has gained increasing attention; yet it still lacks a deployable system that lexically indexes visual documents to serve queries without neural encoding at scale. Existing methods either achieve strong retrieval quality with VLM-based dense or multi-vector models but require neural query encoding at serving time, or avoid query encoding with OCR- or caption-based BM25 at the cost of time-consuming text extraction or generation. To fill this missing serving regime, we present V-SPLADE, an inference-free sparse retriever for visual-document retrieval. However, such inference-free multimodal learned sparse retrieval systems remain underexplored and have not yet shown dense-level effectiveness under high sparsity. We attribute this limitation to a lexical grounding problem: visual sparse representations often fail to capture the lexical content embedded in document images. To address this problem, we introduce caption-gated token supervision, a training-only signal that uses VLM-generated captions as lexical cues to activate retrieval-relevant vocabulary dimensions. With this supervision, V-SPLADE improves average NDCG@5 across six visual-document retrieval benchmarks by +13.8pp over the same-scale dense baseline and by up to +6.3pp over OCR- or caption-based BM25 baselines. On an 18.7M-document corpus, it more than doubles R@5 over the same-scale dense baseline and further improves competing retrievers through score fusion by up to +2.4pp R@5. Code will be released soon at https://github.com/naver/v-splade.

翻译：随着arXiv论文、企业PDF等大规模视觉文档语料库持续增长，视觉文档检索日益受到关注，但仍缺乏一种可部署的系统，能够在不依赖大规模神经编码的情况下，通过词汇索引视觉文档以服务查询。现有方法要么基于VLM的稠密或多向量模型实现强检索质量，但需在服务时进行神经查询编码；要么通过OCR或字幕型BM25避免查询编码，但以耗时文本提取或生成为代价。为填补这一服务模式的缺失，我们提出V-SPLADE——一种用于视觉文档检索的无推理稀疏检索器。然而，此类无推理多模态学习型稀疏检索系统仍鲜有探索，且在高稀疏性条件下尚未展现出稠密级有效性。我们将此局限归因于词汇接地问题：视觉稀疏表示常未能捕捉文档图像中嵌入的词汇内容。为解决该问题，我们引入字幕门控标记监督——一种仅用于训练的信号，利用VLM生成的字幕作为词汇线索激活与检索相关的词汇维度。采用此监督后，V-SPLADE在六个视觉文档检索基准上的平均NDCG@5相比同规模稠密基线提升13.8个百分点，相比OCR或字幕型BM25基线提升最高6.3个百分点。在包含1870万文档的语料库上，其R@5相比同规模稠密基线提升一倍以上，并通过分数融合进一步将竞争检索器的R@5提升最高2.4个百分点。代码将于近期在https://github.com/naver/v-splade发布。