This paper introduces Web-DRO, an unsupervised dense retrieval model, which clusters documents based on web structures and reweights the groups during contrastive training. Specifically, we first leverage web graph links and contrastively train an embedding model for clustering anchor-document pairs. Then we use Group Distributional Robust Optimization to reweight different clusters of anchor-document pairs, which guides the model to assign more weights to the group with higher contrastive loss and pay more attention to the worst case during training. Our experiments on MS MARCO and BEIR show that our model, Web-DRO, significantly improves the retrieval effectiveness in unsupervised scenarios. A comparison of clustering techniques shows that training on the web graph combining URL information reaches optimal performance on clustering. Further analysis confirms that group weights are stable and valid, indicating consistent model preferences as well as effective up-weighting of valuable groups and down-weighting of uninformative ones. The code of this paper can be obtained from https://github.com/OpenMatch/Web-DRO.
翻译:本文提出Web-DRO,一种无监督稠密检索模型,该模型基于网页结构对文档进行聚类,并在对比训练过程中重新加权各组。具体而言,我们首先利用网络图链接,通过对比训练嵌入模型来聚类锚点-文档对。随后,我们采用组分布鲁棒优化对不同的锚点-文档对聚类组进行重新加权,引导模型为对比损失较高的组分配更大权重,从而在训练中更关注最差情况。我们在MS MARCO和BEIR上的实验表明,我们的模型Web-DRO在无监督场景下显著提升了检索效果。聚类技术对比显示,结合URL信息的网络图训练在聚类上取得了最优性能。进一步分析证实,组权重稳定且有效,表明模型偏好一致,并能有效提升有价值组的权重、降低无信息组的权重。本文代码可从https://github.com/OpenMatch/Web-DRO获取。