Although Multimodal Large Language Models (MLLMs) have shown remarkable potential in Visual Document Retrieval (VDR) through generating high-quality multi-vector embeddings, the substantial storage overhead caused by representing a page with thousands of visual tokens limits their practicality in real-world applications. To address this challenge, we propose an auto-regressive generation approach, CausalEmbed, for constructing multi-vector embeddings. By incorporating iterative margin loss during contrastive training, CausalEmbed encourages the embedding models to learn compact and well-structured representations. Our method enables efficient VDR tasks using only dozens of visual tokens, achieving a 30-155x reduction in token count while maintaining highly competitive performance across various backbones and benchmarks. Theoretical analysis and empirical results demonstrate the unique advantages of auto-regressive embedding generation in terms of training efficiency and scalability at test time. As a result, CausalEmbed introduces a flexible test-time scaling strategy for multi-vector VDR representations and sheds light on the generative paradigm within multimodal document retrieval. Our code is available at https://github.com/Z1zs/Causal-Embed.
翻译:尽管多模态大语言模型(MLLMs)通过生成高质量的多向量嵌入在视觉文档检索(VDR)中展现出显著潜力,但使用数千个视觉标记来表示一个页面所带来的巨大存储开销限制了其在实际应用中的实用性。为应对这一挑战,我们提出了一种自回归生成方法——CausalEmbed,用于构建多向量嵌入。通过在对比训练中引入迭代边界损失,CausalEmbed促使嵌入模型学习紧凑且结构良好的表示。我们的方法仅需使用数十个视觉标记即可实现高效的VDR任务,在保持多种骨干网络和基准测试中极具竞争力的性能的同时,将标记数量减少了30至155倍。理论分析和实证结果表明,自回归嵌入生成在训练效率和测试时的可扩展性方面具有独特优势。因此,CausalEmbed为多向量VDR表示引入了一种灵活的测试时扩展策略,并为多模态文档检索中的生成范式提供了新的见解。我们的代码公开于 https://github.com/Z1zs/Causal-Embed。