Semantic Embedding Models (SEMs) have become a core component in information retrieval and natural language processing due to their ability to model semantic relevance. However, despite its growing applications in search engines, few studies have systematically explored how to construct effective training data for SEMs from large-scale search engine query logs. In this paper, we present a comprehensive analysis of strategies for generating pairwise judgments as SEM training data. An interesting (perhaps surprising) discovery reveals that conventional formulation approaches used in Learning-to-Rank (LTR) are not necessarily optimal for SEM training. Through a large-scale empirical study using query logs and click-through data from a major search engine, we identify effective strategies and demonstrate the advantages of a proposed hybrid heuristic over simpler atomic heuristics. Finally, we provide best practices for SEM training and outline directions for future research.
翻译:语义嵌入模型(SEMs)因其建模语义相关性的能力,已成为信息检索和自然语言处理的核心组件。然而,尽管其在搜索引擎中的应用日益广泛,但很少有研究系统地探讨如何从大规模搜索引擎查询日志中为SEMs构建有效的训练数据。本文对生成成对判断作为SEM训练数据的策略进行了全面分析。一个有趣(或许令人惊讶)的发现表明,学习排序(LTR)中使用的传统构建方法对于SEM训练未必是最优的。通过利用来自主流搜索引擎的查询日志和点击数据开展大规模实证研究,我们识别出有效的策略,并证明了所提出的混合启发式方法相较于简单原子启发式的优势。最后,我们提供了SEM训练的最佳实践,并展望了未来的研究方向。