Information Retrieval (IR) methods aim to identify documents relevant to a query, which have been widely applied in various natural language tasks. However, existing approaches typically consider only the textual content within documents, overlooking the fact that documents can contain multiple modalities, including images and tables. Also, they often segment each long document into multiple discrete passages for embedding, which prevents them from capturing the overall document context and interactions between paragraphs. To address these two challenges, we propose a method that holistically embeds documents interleaved with multiple modalities by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation. Moreover, to mitigate the information loss from segmenting documents into passages, instead of representing and retrieving passages individually, we further merge the representations of segmented passages into one single document representation, while we additionally introduce a reranking strategy to decouple and identify the relevant passage within the document if necessary. Then, through extensive experiments on diverse IR scenarios considering both the textual and multimodal queries, we show that our approach substantially outperforms relevant baselines, thanks to the consideration of the multimodal information within documents.
翻译:信息检索(Information Retrieval, IR)方法旨在识别与查询相关的文档,已广泛应用于各类自然语言处理任务中。然而,现有方法通常仅考虑文档中的文本内容,忽略了文档可能包含图像、表格等多种模态信息。此外,这些方法通常将长文档分割为多个离散段落进行嵌入,导致无法捕捉文档的整体语境及段落间的交互关系。为应对这两项挑战,本文提出一种方法,借助近期视觉-语言模型能够处理并整合文本、图像及表格为统一格式与表示的能力,对包含多模态信息的交错文档进行整体嵌入。同时,为缓解文档分割造成的信息损失,我们不再对分割后的段落进行独立表示与检索,而是将分段后的段落表示融合为单一文档表示;此外,我们还引入重排序策略,在必要时对文档中的相关段落进行解耦与定位。通过在涵盖文本与多模态查询的多种信息检索场景中进行大量实验,我们证明:得益于对文档内多模态信息的充分考虑,本方法显著优于相关基线模型。