Document retrieval for tasks such as search and retrieval-augmented generation typically involves datasets that are unstructured: free-form text without explicit internal structure in each document. However, documents can have a structured form, consisting of fields such as an article title, message body, or HTML header. To address this gap, we introduce Multi-Field Adaptive Retrieval (MFAR), a flexible framework that accommodates any number of and any type of document indices on structured data. Our framework consists of two main steps: (1) the decomposition of an existing document into fields, each indexed independently through dense and lexical methods, and (2) learning a model which adaptively predicts the importance of a field by conditioning on the document query, allowing on-the-fly weighting of the most likely field(s). We find that our approach allows for the optimized use of dense versus lexical representations across field types, significantly improves in document ranking over a number of existing retrievers, and achieves state-of-the-art performance for multi-field structured data.
翻译:在搜索和检索增强生成等任务中,文档检索通常涉及非结构化数据集:即每个文档内部没有显式结构的自由文本。然而,文档可以具有结构化形式,由诸如文章标题、消息正文或HTML头部等字段组成。为弥补这一差距,我们提出了多领域自适应检索(MFAR),这是一个灵活的框架,能够适应结构化数据上任意数量和任意类型的文档索引。我们的框架包含两个主要步骤:(1)将现有文档分解为多个字段,每个字段通过稠密和词汇方法独立建立索引;(2)学习一个模型,该模型通过以文档查询为条件自适应地预测字段的重要性,从而实现对最可能字段的动态加权。我们发现,该方法允许跨字段类型优化使用稠密表示与词汇表示,在文档排序方面显著优于多种现有检索器,并在多字段结构化数据上实现了最先进的性能。