Spatially-Grounded Document Retrieval via Patch-to-Region Relevance Propagation

from arxiv, 21 pages, 6 figures, 8 tables. Includes ancillary files with full benchmark results and ablation studies. Code available at https://github.com/athrael-soju/Snappy

Late-interaction multimodal retrieval models like ColPali achieve state-of-the-art document retrieval by embedding pages as images and computing fine-grained similarity between query tokens and visual patches. However, they operate at page-level granularity, limiting utility for retrieval-augmented generation (RAG) where precise context is paramount. Conversely, OCR-based systems extract structured text with bounding box coordinates but lack semantic grounding for relevance assessment. We propose a hybrid architecture that unifies these paradigms: using ColPali's patch-level similarity scores as spatial relevance filters over OCR-extracted regions. We formalize the coordinate mapping between vision transformer patch grids and OCR bounding boxes, introduce intersection metrics for relevance propagation, and establish theoretical bounds on area efficiency. We evaluate on BBox-DocVQA with ground-truth bounding boxes. For within-page localization (given correct page retrieval), ColQwen3-4B with percentile-50 thresholding achieves 59.7% hit rate at [email protected] (84.4% at [email protected], 35.8% at [email protected]), with mean IoU of 0.569, compared to ~6.7% for random region selection. Our approach reduces context tokens by 28.8% compared to returning all OCR regions and by 52.3% compared to full-page image tokens. Our approach operates at inference time without additional training. We release Snappy, an open-source implementation at https://github.com/athrael-soju/Snappy.

翻译：诸如ColPali等后期交互式多模态检索模型通过将页面嵌入为图像，并计算查询词元与视觉补丁之间的细粒度相似性，实现了最先进的文档检索性能。然而，这些模型在页面粒度级别上操作，限制了其在需要精确上下文的检索增强生成（RAG）应用中的效用。相反，基于OCR的系统虽能提取带有边界框坐标的结构化文本，但缺乏用于相关性评估的语义基础。我们提出了一种融合这两种范式的混合架构：利用ColPli的补丁级相似性分数作为对OCR提取区域的空间相关性过滤器。我们形式化了视觉Transformer补丁网格与OCR边界框之间的坐标映射关系，引入了用于相关性传播的交集度量指标，并建立了面积效率的理论界限。我们在带有真实边界框标注的BBox-DocVQA数据集上进行评估。对于页面内定位任务（在给定正确页面检索结果的前提下），采用百分位50阈值的ColQwen3-4B模型在[email protected]指标上达到59.7%的命中率（[email protected]为84.4%，[email protected]为35.8%），平均IoU为0.569，而随机区域选择方法的命中率约为6.7%。与返回所有OCR区域相比，我们的方法减少了28.8%的上下文词元；与返回整页图像词元相比，减少了52.3%。我们的方法在推理阶段运行，无需额外训练。我们开源了Snappy实现，代码发布于https://github.com/athrael-soju/Snappy。