In information retrieval, facet identification of a user query is an important task. If a search service can recognize the facets of a user's query, it has the potential to offer users a much broader range of search results. Previous studies can enhance facet prediction by leveraging retrieved documents and related queries obtained through a search engine. However, there are challenges in extending it to other applications when a search engine operates as part of the model. First, search engines are constantly updated. Therefore, additional information may change during training and test, which may reduce performance. The second challenge is that public search engines cannot search for internal documents. Therefore, a separate search system needs to be built to incorporate documents from private domains within the company. We propose two strategies that focus on a framework that can predict facets by taking only queries as input without a search engine. The first strategy is multi-task learning to predict SERP. By leveraging SERP as a target instead of a source, the proposed model deeply understands queries without relying on external modules. The second strategy is to enhance the facets by combining Large Language Model (LLM) and the small model. Overall performance improves when small model and LLM are combined rather than facet generation individually.
翻译:在信息检索中,用户查询的分面识别是一项重要任务。若搜索引擎能识别用户查询的分面,将有望为用户提供更广泛的搜索结果。以往研究通过利用检索文档和搜索引擎获取的相关查询来增强分面预测能力。然而,当搜索引擎作为模型组成部分时,将其扩展到其他应用场景面临两大挑战:其一,搜索引擎会持续更新,导致训练和测试阶段获取的附加信息可能发生变化,从而影响性能;其二,公共搜索引擎无法搜索内部文档,因此需要为私有领域文档构建独立的搜索系统。本文提出两种仅以查询为输入、无需搜索引擎的分面预测框架策略。第一种策略采用多任务学习预测搜索引擎结果页面(SERP),通过将SERP作为目标而非数据源,使模型在不依赖外部模块的情况下深入理解查询语义。第二种策略通过融合大语言模型(LLM)与小模型来增强分面生成。实验表明,相较于单独进行分面生成,小模型与大模型的组合能显著提升整体性能。