Instead of simply matching a query to pre-existing passages, generative retrieval generates identifier strings of passages as the retrieval target. At a cost, the identifier must be distinctive enough to represent a passage. Current approaches use either a numeric ID or a text piece (such as a title or substrings) as the identifier. However, these identifiers cannot cover a passage's content well. As such, we are motivated to propose a new type of identifier, synthetic identifiers, that are generated based on the content of a passage and could integrate contextualized information that text pieces lack. Furthermore, we simultaneously consider multiview identifiers, including synthetic identifiers, titles, and substrings. These views of identifiers complement each other and facilitate the holistic ranking of passages from multiple perspectives. We conduct a series of experiments on three public datasets, and the results indicate that our proposed approach performs the best in generative retrieval, demonstrating its effectiveness and robustness.
翻译:不同于简单地将查询与已有段落进行匹配,生成式检索将段落的标识符字符串作为检索目标。这要求标识符须具备足够区分度以表征段落。现有方法采用数字标识或文本片段(如标题或子串)作为标识符,但这类标识符难以全面覆盖段落内容。为此,我们提出一种新型合成标识符——基于段落内容生成,能整合文本片段所缺失的上下文信息。进一步地,我们同时考虑包括合成标识符、标题和子串在内的多视角标识符。这些标识符视角相互补充,有助于从多维度实现段落的整体排序。我们在三个公开数据集上开展系列实验,结果表明所提方法在生成式检索中表现最优,充分验证了其有效性与鲁棒性。