We apply an information-theoretic perspective to reconsider generative document retrieval (GDR), in which a document $x \in X$ is indexed by $t \in T$, and a neural autoregressive model is trained to map queries $Q$ to $T$. GDR can be considered to involve information transmission from documents $X$ to queries $Q$, with the requirement to transmit more bits via the indexes $T$. By applying Shannon's rate-distortion theory, the optimality of indexing can be analyzed in terms of the mutual information, and the design of the indexes $T$ can then be regarded as a {\em bottleneck} in GDR. After reformulating GDR from this perspective, we empirically quantify the bottleneck underlying GDR. Finally, using the NQ320K and MARCO datasets, we evaluate our proposed bottleneck-minimal indexing method in comparison with various previous indexing methods, and we show that it outperforms those methods.
翻译:我们从信息论视角重新审视生成式文档检索(GDR),其中文档 $x \in X$ 由索引 $t \in T$ 表示,神经自回归模型被训练用于将查询 $Q$ 映射到 $T$。GDR可视为涉及从文档 $X$ 到查询 $Q$ 的信息传输过程,且要求通过索引 $T$ 传输更多比特信息。通过应用香农率失真理论,索引的最优性可从互信息角度进行分析,进而索引 $T$ 的设计可被视作GDR中的"瓶颈"。在从该视角重新阐述GDR后,我们通过实验量化了GDR中的潜在瓶颈。最后,利用NQ320K和MARCO数据集,我们将所提出的瓶颈最小化索引方法与多种传统索引方法进行比较,结果表明该方法性能更优。