Information Retrieval (IR) methods aim to identify documents relevant to a query, which have been widely applied in various natural language tasks. However, existing approaches typically consider only the textual content within documents, overlooking the fact that documents can contain multiple modalities, including images and tables. Also, they often segment each long document into multiple discrete passages for embedding, which prevents them from capturing the overall document context and interactions between paragraphs. To address these two challenges, we propose a method that holistically embeds documents interleaved with multiple modalities by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation. Moreover, to mitigate the information loss from segmenting documents into passages, instead of representing and retrieving passages individually, we further merge the representations of segmented passages into one single document representation, while we additionally introduce a reranking strategy to decouple and identify the relevant passage within the document if necessary. Then, through extensive experiments on diverse IR scenarios considering both the textual and multimodal queries, we show that our approach substantially outperforms relevant baselines, thanks to the consideration of the multimodal information within documents.
翻译:信息检索(IR)方法旨在识别与查询相关的文档,这些方法已广泛应用于各种自然语言任务中。然而,现有方法通常仅考虑文档内的文本内容,忽略了文档可能包含多种模态(包括图像和表格)的事实。此外,它们通常将每个长文档分割成多个离散的段落进行嵌入,这阻碍了它们捕获整体文档上下文以及段落之间的交互。为应对这两个挑战,我们提出一种方法,利用近期视觉-语言模型的能力,将文本、图像和表格处理并整合为统一格式和表示,从而对包含多种模态的交错文档进行整体嵌入。此外,为减轻将文档分割为段落造成的信息损失,我们不再单独表示和检索段落,而是进一步将分割段落的表示合并为单一的文档表示;同时,我们还额外引入了一种重排序策略,以便在必要时解耦并识别文档内的相关段落。随后,通过在考虑文本和多模态查询的多种IR场景中进行大量实验,我们证明,得益于对文档内多模态信息的考量,我们的方法显著优于相关基线。