Unified Multi-Modal Interleaved Document Representation for Information Retrieval

Information Retrieval (IR) methods aim to identify relevant documents in response to a given query, which have gained remarkable attention due to their successful application in various natural language tasks. However, existing approaches typically consider only the textual information within the documents, which overlooks the fact that documents can contain multiple modalities, including texts, images, and tables. Further, they often segment each long document into multiple discrete passages for embedding, preventing them from capturing the overall document context and interactions between paragraphs. We argue that these two limitations lead to suboptimal document representations for retrieval. In this work, to address them, we aim to produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities. Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation. Moreover, to mitigate the information loss from segmenting documents into passages, instead of representing and retrieving passages individually, we further merge the representations of segmented passages into one single document representation, while we additionally introduce a reranking strategy to decouple and identify the relevant passage within the document if necessary. Then, through extensive experiments on diverse information retrieval scenarios considering both the textual and multimodal queries, we show that our approach substantially outperforms relevant baselines, thanks to the consideration of the multimodal information interleaved within the documents in a unified way.

翻译：信息检索方法旨在识别与给定查询相关的文档，由于其在各种自然语言任务中的成功应用而受到广泛关注。然而，现有方法通常仅考虑文档中的文本信息，忽略了文档可能包含多种模态（包括文本、图像和表格）的事实。此外，这些方法通常将每个长文档分割为多个离散的段落进行嵌入，从而无法捕捉文档的整体上下文及段落间的交互作用。我们认为这两种限制导致检索中使用的文档表示不够优化。在本研究中，为解决这些问题，我们旨在通过整体嵌入交错不同模态的文档来生成更全面、更细致的文档表示。具体而言，我们利用近期视觉语言模型的能力实现这一目标，这些模型能够将文本、图像和表格处理并整合为统一的格式和表示。此外，为减少因文档分割为段落造成的信息损失，我们不再单独表示和检索段落，而是进一步将分割段落的表示合并为单一的文档表示；同时，我们还引入了重排序策略，以便在必要时解耦并识别文档中的相关段落。随后，通过对考虑文本和多模态查询的多样化信息检索场景进行大量实验，我们证明：由于以统一方式考虑了文档中交错的多模态信息，我们的方法显著优于相关基线模型。