Visual RAG has offered an alternative to traditional RAG. It treats documents as images and uses vision encoders to obtain vision patch tokens. However, hundreds of patch tokens per document create retrieval and storage challenges in a vector database. Practical deployment requires aggregating them into a single vector. This raises a critical question: does single-vector aggregation lose key information in financial documents? We develop a diagnostic benchmark using financial documents where changes in single digits can lead to significant semantic shifts. Our experiments show that single-vector aggregation collapses different documents with almost identical vectors. Metrics show that the patch level detects semantic changes, and confirm that aggregation obscures these details. We identify global texture dominance as the root cause. Our findings are consistent across model scales, retrieval-optimized embeddings, and multiple mitigation strategies, highlighting significant risks for single-vector visual document retrieval in financial applications.
翻译:视觉RAG为传统RAG提供了替代方案。它将文档视为图像,并利用视觉编码器获取视觉块标记。然而,每个文档数百个视觉块标记在向量数据库中带来了检索和存储挑战。实际部署需要将它们聚合为单一向量。这引发了一个关键问题:单向量聚合是否会丢失金融文档中的关键信息?我们基于金融文档构建了一个诊断基准,其中单数字的变更可能导致显著的语义偏移。实验表明,单向量聚合使不同文档的向量几乎完全重合。指标显示,视觉块层面能够检测语义变化,并证实聚合操作会模糊这些细节。我们识别出全局纹理主导是根本原因。该发现在模型规模、检索优化嵌入及多种缓解策略中保持一致,凸显了单向量视觉文档检索在金融应用中的重大风险。