Hard negatives play a critical role in training and fine-tuning dense retrieval models, as they are semantically similar to positive documents yet non-relevant, and correctly distinguishing them is essential for improving retrieval accuracy. However, identifying effective hard negatives typically requires extensive ablation studies involving repeated fine-tuning with different negative sampling strategies and hyperparameters, resulting in substantial computational cost. In this paper, we introduce ECI: Effective Contrastive Information , a theoretically grounded metric grounded in Information Theory and Information Retrieval principles that enables practitioners to assess the quality of hard negatives prior to model fine-tuning. ECI evaluates negatives by optimizing the trade-off between Information Capacity the logarithmic bound on mutual information determined by set size and Discriminative Efficiency, a harmonic balance of Signal Magnitude (Hardness) and Safety (Max-Margin). Unlike heuristic approaches, ECI strictly penalizes unsafe, false-positive negatives prevalent in generative methods. We evaluate ECI across hard-negative sets mined or generated using BM25, cross-encoders, and large language models. Our results demonstrate that ECI accurately predicts downstream retrieval performance, identifying that hybrid strategies (BM25+Cross-Encoder) offer the optimal balance of volume and reliability, significantly reducing the need for costly end-to-end ablation studies.
翻译:困难负样本在稠密检索模型的训练和微调中扮演着关键角色,因为它们与正例文档语义相似但不相关,正确区分它们对提升检索精度至关重要。然而,识别有效的困难负样本通常需要大量的消融实验,涉及使用不同的负采样策略和超参数反复进行微调,导致巨大的计算成本。在本文中,我们提出ECI:有效对比信息,这是一种基于信息论与信息检索原理的具备理论基础的度量标准,使从业者能够在模型微调前评估困难负样本的质量。ECI通过优化信息容量(由集合大小确定的互信息对数界限)与判别效率(信号强度(难度)与安全性(最大间隔)的调和平衡)之间的权衡来评估负样本。与启发式方法不同,ECI严格惩罚生成方法中普遍存在的不安全、假正负样本。我们使用BM25、交叉编码器及大型语言模型挖掘或生成的困难负样本集对ECI进行了评估。结果表明,ECI能准确预测下游检索性能,并识别出混合策略(BM25+交叉编码器)在数据量与可靠性之间达到了最优平衡,从而显著减少了昂贵的端到端消融实验需求。