DiffuGR: Generative Document Retrieval with Diffusion Language Models

Generative retrieval (GR) reframes document retrieval as an end-to-end task of generating sequential document identifiers (DocIDs). Existing GR methods predominantly rely on left-to-right auto-regressive decoding, which suffers from two fundamental limitations: (i) a \emph{mismatch between DocID generation and natural language generation}, whereby an incorrect DocID token generated at an early step can lead to entirely erroneous retrieval; and (ii) an \emph{inability to dynamically balance the trade-off between retrieval efficiency and accuracy}, which is crucial for practical applications. To tackle these challenges, we propose generative document retrieval with diffusion language models, termed \emph{DiffuGR}. DiffuGR formulates DocID generation as a discrete diffusion process. During training, DocIDs are corrupted through a stochastic masking process, and a diffusion language model is trained to recover them under a retrieval-aware objective. For inference, DiffuGR generates DocID tokens in parallel and refines them through a controllable number of denoising steps. Unlike auto-regressive decoding, DiffuGR introduce \emph{a novel mechanism to first generate plenty of confident DocID tokens and then refine the generation through diffusion-based denoising}. Moreover, DiffuGR also offers \emph{explicit runtime control over the quality-latency tradeoff}. Extensive experiments on widely-applied retrieval benchmarks show that DiffuGR outperforms strong auto-regressive generative retrievers. Additionally, we verify that DiffuGR achieves flexible control over the quality-latency trade-off via variable denoising budgets.

翻译：生成式检索（GR）将文档检索重新定义为生成序列化文档标识符（DocID）的端到端任务。现有GR方法主要依赖自左向右的自回归解码，存在两个根本性局限：（i）**DocID生成与自然语言生成之间的不匹配**——早期步骤生成的错误DocID标记可能导致完全错误的检索结果；（ii）**无法动态权衡检索效率与准确性之间的平衡**——这对实际应用至关重要。为应对这些挑战，我们提出基于扩散语言模型的生成式文档检索方法，称为**DiffuGR**。DiffuGR将DocID生成建模为离散扩散过程：训练阶段通过随机掩码过程对DocID进行扰动，并在检索感知目标下训练扩散语言模型以恢复原始标识符；推理阶段则并行生成DocID标记并通过可控的去噪步骤进行优化。与自回归解码不同，DiffuGR**引入创新机制：首先生成大量高置信度的DocID标记，随后通过基于扩散的去噪过程进行精细化生成**。此外，DiffuGR还**提供对质量-延迟权衡的显式运行时控制**。在广泛应用的检索基准上的大量实验表明，DiffuGR显著优于强自回归生成式检索模型。同时，我们验证了DiffuGR能够通过可变的去噪预算实现对质量-延迟权衡的灵活调控。