Domain-specific finetuning is essential for dense retrievers, yet not all training pairs contribute equally to the learning process. We introduce OPERA, a data pruning framework that exploits this heterogeneity to improve both the effectiveness and efficiency of retrieval model adaptation. We first investigate static pruning (SP), which retains only high-similarity query-document pairs, revealing an intrinsic quality-coverage tradeoff: ranking (NDCG) improves while retrieval (Recall) can degrade due to reduced query diversity. To resolve this tradeoff, we propose a two-stage dynamic pruning (DP) strategy that adaptively modulates sampling probabilities at both query and document levels throughout training, prioritizing high-quality examples while maintaining access to the full training set. Evaluations across eight datasets spanning six domains demonstrate the effectiveness of both approaches: SP improves ranking over standard finetuning (NDCG@10 +0.5\%), while DP achieves the strongest performance on both ranking (NDCG@10 +1.9\%) and retrieval (Recall@20 +0.7\%), with an average rank of 1.38 across all methods. These findings scale to Qwen3-Embedding, an LLM-based dense retriever, confirming architecture-agnostic benefits. Notably, DP reaches comparable performance in less than 50\% of the training time required by standard finetuning.
翻译:领域特定的微调对于稠密检索器至关重要,然而并非所有训练样本对都对学习过程具有同等贡献。我们提出了OPERA,一种数据剪枝框架,它利用这种异质性来提升检索模型自适应的效果与效率。我们首先研究了静态剪枝方法,该方法仅保留高相似度的查询-文档对,揭示了一种内在的质量-覆盖率权衡:排序性能(NDCG)得到提升,而检索性能(召回率)可能因查询多样性减少而下降。为解决这一权衡问题,我们提出了一种两阶段动态剪枝策略,该策略在训练过程中自适应地调整查询级和文档级的采样概率,优先考虑高质量样本,同时保持对完整训练集的访问。在涵盖六个领域的八个数据集上的评估证明了两种方法的有效性:静态剪枝在排序性能上优于标准微调(NDCG@10 +0.5%),而动态剪枝在排序(NDCG@10 +1.9%)和检索(Recall@20 +0.7%)两方面均取得了最强性能,在所有方法中平均排名为1.38。这些发现可扩展至基于大语言模型的稠密检索器Qwen3-Embedding,证实了其架构无关的益处。值得注意的是,动态剪枝仅需标准微调不到50%的训练时间即可达到相当的性能。