Visual RAG Toolkit: Scaling Multi-Vector Visual Retrieval with Training-Free Pooling and Multi-Stage Search

Multi-vector visual retrievers (e.g., ColPali-style late interaction models) deliver strong accuracy, but scale poorly because each page yields thousands of vectors, making indexing and search increasingly expensive. We present Visual RAG Toolkit, a practical system for scaling visual multi-vector retrieval with training-free, model-aware pooling and multi-stage retrieval. Motivated by Matryoshka Embeddings, our method performs static spatial pooling - including a lightweight sliding-window averaging variant - over patch embeddings to produce compact tile-level and global representations for fast candidate generation, followed by exact MaxSim reranking using full multi-vector embeddings. Our design yields a quadratic reduction in vector-to-vector comparisons by reducing stored vectors per page from thousands to dozens, notably without requiring post-training, adapters, or distillation. Across experiments with interaction-style models such as ColPali and ColSmol-500M, we observe that over the limited ViDoRe v2 benchmark corpus 2-stage retrieval typically preserves NDCG and Recall @ 5/10 with minimal degradation, while substantially improving throughput (approximately 4x QPS); with sensitivity mainly at very large k. The toolkit additionally provides robust preprocessing - high resolution PDF to image conversion, optional margin/empty-region cropping and token hygiene (indexing only visual tokens) - and a reproducible evaluation pipeline, enabling rapid exploration of two-, three-, and cascaded retrieval variants. By emphasizing efficiency at common cutoffs (e.g., k <= 10), the toolkit lowers hardware barriers and makes state-of-the-art visual retrieval more accessible in practice.

翻译：多向量视觉检索模型（例如ColPali风格的延迟交互模型）虽能提供较高的准确度，但可扩展性较差，因为每个页面会产生数千个向量，导致索引和搜索成本急剧增加。本文提出视觉RAG工具包，这是一个通过免训练的模型感知池化与多阶段检索来实现视觉多向量检索规模化的实用系统。受套娃嵌入（Matryoshka Embeddings）启发，我们的方法对图像块嵌入执行静态空间池化——包括一种轻量级的滑动窗口平均变体——以生成紧凑的图块级和全局表征用于快速候选生成，随后使用完整的多向量嵌入进行精确的MaxSim重排序。该设计通过将每个页面的存储向量从数千个减少至数十个，实现了向量间比较次数的二次方级降低，且无需后训练、适配器或蒸馏。在ColPali和ColSmol-500M等交互式模型上的实验表明，在有限的ViDoRe v2基准语料库上，两阶段检索通常能保持NDCG和Recall@5/10指标基本不变（仅轻微下降），同时显著提升吞吐量（约4倍QPS）；其敏感性主要体现在极大k值情况下。该工具包还提供鲁棒的预处理功能——包括高分辨率PDF至图像转换、可选边距/空白区域裁剪及词汇规范化（仅索引视觉词汇）——以及可复现的评估流程，支持快速探索两阶段、三阶段及级联检索变体。通过聚焦常见截断值（如k≤10）下的效率优化，本工具包降低了硬件门槛，使前沿视觉检索技术在实践中更具可及性。