DiffuGR: Generative Document Retrieval with Diffusion Language Models

Generative retrieval (GR) reframes document retrieval as an end-to-end task of generating sequential document identifiers (DocIDs). Existing GR methods predominantly rely on left-to-right auto-regressive decoding, which suffers from two fundamental limitations: (i) a \emph{mismatch between DocID generation and natural language generation}, whereby an incorrect DocID token generated at an early step can lead to entirely erroneous retrieval; and (ii) an \emph{inability to dynamically balance the trade-off between retrieval efficiency and accuracy}, which is crucial for practical applications. To tackle these challenges, we propose generative document retrieval with diffusion language models, termed \emph{DiffuGR}. DiffuGR formulates DocID generation as a discrete diffusion process. During training, DocIDs are corrupted through a stochastic masking process, and a diffusion language model is trained to recover them under a retrieval-aware objective. For inference, DiffuGR generates DocID tokens in parallel and refines them through a controllable number of denoising steps. Unlike auto-regressive decoding, DiffuGR introduce \emph{a novel mechanism to first generate plenty of confident DocID tokens and then refine the generation through diffusion-based denoising}. Moreover, DiffuGR also offers \emph{explicit runtime control over the quality-latency tradeoff}. Extensive experiments on widely-applied retrieval benchmarks show that DiffuGR outperforms strong auto-regressive generative retrievers. Additionally, we verify that DiffuGR achieves flexible control over the quality-latency trade-off via variable denoising budgets.

翻译：生成式检索（GR）将文档检索重新定义为生成序列化文档标识符（DocID）的端到端任务。现有GR方法主要依赖自左向右的自回归解码，存在两个根本性局限：（i）**DocID生成与自然语言生成之间的不匹配**，早期步骤生成的错误DocID标记可能导致完全错误的检索结果；（ii）**无法动态平衡检索效率与准确性的权衡**，而这在实际应用中至关重要。为应对这些挑战，我们提出基于扩散语言模型的生成式文档检索方法，称为**DiffuGR**。DiffuGR将DocID生成建模为离散扩散过程：训练阶段通过随机掩码过程对DocID进行扰动，并在检索感知目标下训练扩散语言模型以恢复原始DocID；推理阶段则并行生成DocID标记，并通过可控的去噪步骤进行优化。与自回归解码不同，DiffuGR**首创先生成大量高置信度DocID标记、再通过基于扩散的去噪机制进行细化的新范式**。此外，DiffuGR还具备**对质量-延迟权衡的显式运行时控制能力**。在广泛应用的检索基准上的大量实验表明，DiffuGR显著优于强自回归生成式检索模型。我们进一步验证了DiffuGR可通过可变的去噪预算实现对质量-延迟权衡的灵活调控。