Generative Retrieval via Term Set Generation

Recently, generative retrieval emerges as a promising alternative to traditional retrieval paradigms. It assigns each document a unique identifier, known as DocID, and employs a generative model to directly generate the relevant DocID for the input query. A common choice for DocID is one or several natural language sequences, e.g. the title or n-grams, so that the pre-trained knowledge of the generative model can be utilized. However, a sequence is generated token by token, where only the most likely candidates are kept and the rest are pruned at each decoding step, thus, retrieval fails if any token within the relevant DocID is falsely pruned. What's worse, during decoding, the model can only perceive preceding tokens in DocID while being blind to subsequent ones, hence is prone to make such errors. To address this problem, we present a novel framework for generative retrieval, dubbed Term-Set Generation (TSGen). Instead of sequences, we use a set of terms as DocID, which are automatically selected to concisely summarize the document's semantics and distinguish it from others. On top of the term-set DocID, we propose a permutation-invariant decoding algorithm, with which the term set can be generated in any permutation yet will always lead to the corresponding document. Remarkably, TSGen perceives all valid terms rather than only the preceding ones at each decoding step. Given the constant decoding space, it can make more reliable decisions due to the broader perspective. TSGen is also resilient to errors: the relevant DocID will not be pruned as long as the decoded term belongs to it. Lastly, we design an iterative optimization procedure to incentivize the model to generate the relevant term set in its favorable permutation. We conduct extensive experiments on popular benchmarks, which validate the effectiveness, the generalizability, the scalability, and the efficiency of TSGen.

翻译：近年来，生成式检索作为一种有前景的传统检索范式替代方案崭露头角。该方法为每个文档分配唯一标识符（即DocID），并利用生成模型直接为输入查询生成相关DocID。DocID的常见选择是一个或多个自然语言序列（如标题或n-gram），以便利用生成模型的预训练知识。然而，序列是通过逐词元生成方式构建的，在每一步解码中仅保留最可能的候选项，其余均被剪枝。因此，若相关DocID中的某个词元被错误剪枝，检索就会失败。更严重的是，解码过程中模型仅能感知DocID中已生成的词元，而无法获知后续词元，因此极易产生此类错误。为解决该问题，我们提出了一种名为术语集合生成（TSGen）的新型生成式检索框架。不同于序列形式，我们采用术语集合作为DocID，这些术语被自动选取以简洁概括文档语义并区别于其他文档。基于术语集合DocID，我们提出了一种置换不变解码算法，该算法允许术语集合以任意排列顺序生成，但始终能正确索引对应文档。值得注意的是，TSGen在每一步解码中不仅能感知已生成术语，还能观测所有有效术语。在恒定解码空间下，更广阔的视角使其能做出更可靠的决策。此外，TSGen对错误具有鲁棒性：只要解码出的术语属于相关DocID，该DocID便不会被剪枝。最后，我们设计了迭代优化流程，激励模型以其偏好的排列方式生成相关术语集合。在主流基准数据集上的大量实验验证了TSGen的有效性、泛化性、可扩展性和高效性。