Labeled datasets are essential for modern search engines, which increasingly rely on supervised learning methods like Learning to Rank and massive amounts of data to power deep learning models. However, creating these datasets is both time-consuming and costly, leading to the common use of user click and activity logs as proxies for relevance. In this paper, we present a weak supervision approach to infer the quality of query-document pairs and apply it within a Learning to Rank framework to enhance the precision of a large-scale search system.
翻译:标注数据集对于现代搜索引擎至关重要,这些引擎日益依赖于监督学习方法(如排序学习)以及海量数据来驱动深度学习模型。然而,创建这些数据集既耗时又昂贵,导致用户点击与活动日志常被用作相关性的代理。本文提出一种弱监督方法,用于推断查询-文档对的质量,并将其应用于排序学习框架中,以提升大规模搜索系统的精度。