Accurate inference of user intent is crucial for enhancing document retrieval in modern search engines. While large language models (LLMs) have made significant strides in this area, their effectiveness has predominantly been assessed with short, keyword-based queries. As AI-driven search evolves, long-form queries with intricate intents are becoming more prevalent, yet they remain underexplored in the context of LLM-based query understanding (QU). To bridge this gap, we introduce ReDI: a Reasoning-enhanced approach for query understanding through Decomposition and Interpretation. ReDI leverages the reasoning and comprehension capabilities of LLMs in a three-stage pipeline: (i) it breaks down complex queries into targeted sub-queries to accurately capture user intent; (ii) it enriches each sub-query with detailed semantic interpretations to improve the query-document matching; and (iii) it independently retrieves documents for each sub-query and employs a fusion strategy to aggregate the results for the final ranking. We compiled a large-scale dataset of real-world complex queries from a major search engine and distilled the query understanding capabilities of teacher models into smaller models for practical application. Experiments on BRIGHT and BEIR demonstrate that ReDI consistently surpasses strong baselines in both sparse and dense retrieval paradigms, affirming its effectiveness. We release our code at https://anonymous.4open.science/r/ReDI-6FC7/.
翻译:在现代搜索引擎中,准确推断用户意图对于提升文档检索效果至关重要。尽管大型语言模型(LLMs)在该领域已取得显著进展,但其有效性主要基于简短的关键词查询进行评估。随着人工智能驱动的搜索不断发展,具有复杂意图的长格式查询日益普遍,但在基于LLM的查询理解(QU)背景下,这类查询仍未得到充分探索。为弥补这一空白,我们提出了ReDI:一种通过分解与解释实现推理增强的查询理解方法。ReDI利用LLMs的推理与理解能力构建三阶段流程:(i)将复杂查询分解为目标明确的子查询,以精确捕捉用户意图;(ii)通过详细的语义解释丰富每个子查询,以改进查询-文档匹配;(iii)为每个子查询独立检索文档,并采用融合策略聚合结果以生成最终排序。我们从主流搜索引擎收集了大规模真实世界复杂查询数据集,并将教师模型的查询理解能力蒸馏至更小的模型以供实际应用。在BRIGHT和BEIR数据集上的实验表明,ReDI在稀疏检索和稠密检索范式中均持续超越强基线模型,验证了其有效性。代码发布于 https://anonymous.4open.science/r/ReDI-6FC7/。