Linear combination is a potent data fusion method in information retrieval tasks, thanks to its ability to adjust weights for diverse scenarios. However, achieving optimal weight training has traditionally required manual relevance judgments on a large percentage of documents, a labor-intensive and expensive process. In this study, we investigate the feasibility of obtaining near-optimal weights using a mere 20\%-50\% of relevant documents. Through experiments on four TREC datasets, we find that weights trained with multiple linear regression using this reduced set closely rival those obtained with TREC's official "qrels." Our findings unlock the potential for more efficient and affordable data fusion, empowering researchers and practitioners to reap its full benefits with significantly less effort.
翻译:线性组合是信息检索任务中一种强大的数据融合方法,因其能够根据不同场景调整权重而备受青睐。然而,传统上实现最优权重训练需要对大量文档进行人工相关性判断,这一过程劳动密集且成本高昂。在本研究中,我们探讨了仅使用20%-50%的相关文档即可获得接近最优权重的可行性。通过在四个TREC数据集上的实验,我们发现基于此精简集采用多元线性回归训练的权重与使用TREC官方"qrels"获得的权重高度接近。我们的研究结果为更高效、更经济的数据融合开辟了道路,使研究人员和实践者能够以显著更低的投入充分受益于数据融合的全部优势。