The rapid advancement of unsupervised representation learning and large-scale pre-trained vision-language models has significantly improved cross-modal retrieval tasks. However, existing multi-modal information retrieval (MMIR) studies lack a comprehensive exploration of document-level retrieval and suffer from the absence of cross-domain datasets at this granularity. To address this limitation, we introduce DocMMIR, a novel multi-modal document retrieval framework designed explicitly to unify diverse document formats and domains, including Wikipedia articles, scientific papers (arXiv), and presentation slides, within a comprehensive retrieval scenario. We construct a large-scale cross-domain multimodal benchmark, comprising 450K samples, which systematically integrates textual and visual information. Our comprehensive experimental analysis reveals substantial limitations in current state-of-the-art MLLMs (CLIP, BLIP2, SigLIP-2, ALIGN) when applied to our tasks, with only CLIP demonstrating reasonable zero-shot performance. Furthermore, we conduct a systematic investigation of training strategies, including cross-modal fusion methods and loss functions, and develop a tailored approach to train CLIP on our benchmark. This results in a +31% improvement in MRR@10 compared to the zero-shot baseline. All our data and code are released in https://github.com/J1mL1/DocMMIR.
翻译:无监督表示学习与大规模预训练视觉-语言模型的快速发展显著提升了跨模态检索任务的性能。然而,现有多模态信息检索研究缺乏对文档级检索的全面探索,且在该粒度上缺少跨领域数据集。为应对这一局限,本文提出DocMMIR——一种新颖的多模态文档检索框架,旨在统一不同文档格式与领域(包括维基百科文章、科学论文(arXiv)及演示文稿幻灯片),并将其纳入统一的检索场景。我们构建了一个大规模跨领域多模态基准数据集,包含45万个样本,系统整合了文本与视觉信息。全面的实验分析表明,当前最先进的多语言模型(CLIP、BLIP2、SigLIP-2、ALIGN)应用于本任务时存在显著局限,仅CLIP展现出合理的零样本性能。此外,我们系统研究了包括跨模态融合方法与损失函数在内的训练策略,并开发了针对本基准数据集训练CLIP的定制化方案,最终在MRR@10指标上较零样本基线提升31%。所有数据与代码均已发布于https://github.com/J1mL1/DocMMIR。