Generative retrieval (Wang et al., 2022; Tay et al., 2022) is a popular approach for end-to-end document retrieval that directly generates document identifiers given an input query. We introduce summarization-based document IDs, in which each document's ID is composed of an extractive summary or abstractive keyphrases generated by a language model, rather than an integer ID sequence or bags of n-grams as proposed in past work. We find that abstractive, content-based IDs (ACID) and an ID based on the first 30 tokens are very effective in direct comparisons with previous approaches to ID creation. We show that using ACID improves top-10 and top-20 recall by 15.6% and 14.4% (relative) respectively versus the cluster-based integer ID baseline on the MSMARCO 100k retrieval task, and 9.8% and 9.9% respectively on the Wikipedia-based NQ 100k retrieval task. Our results demonstrate the effectiveness of human-readable, natural-language IDs created through summarization for generative retrieval. We also observed that extractive IDs outperformed abstractive IDs on Wikipedia articles in NQ but not the snippets in MSMARCO, which suggests that document characteristics affect generative retrieval performance.
翻译:生成式检索(Wang等人,2022;Tay等人,2022)是一种流行的端到端文档检索方法,它直接根据输入查询生成文档标识符。我们提出了基于摘要的文档标识符,其中每个文档的标识符由语言模型生成的抽取式摘要或抽象式关键短语构成,而非以往工作中提出的整数标识符序列或n-gram词袋。我们发现,在与先前标识符创建方法的直接比较中,基于内容的抽象式标识符(ACID)以及基于前30个词元的标识符非常有效。实验表明,在MSMARCO 100k检索任务中,使用ACID相较于基于聚类的整数标识符基线,在top-10和top-20召回率上分别相对提升了15.6%和14.4%;在基于Wikipedia的NQ 100k检索任务中,分别相对提升了9.8%和9.9%。我们的结果证明了通过摘要创建的人类可读自然语言标识符在生成式检索中的有效性。我们还观察到,在NQ任务的Wikipedia文章中,抽取式标识符的表现优于抽象式标识符,但在MSMARCO的文本片段中则不然,这表明文档特性会影响生成式检索的性能。