Retrieval systems often fail when user queries differ stylistically or semantically from the language used in domain documents. Query rewriting has been proposed to bridge this gap, improving retrieval by reformulating user queries into semantically equivalent forms. However, most existing approaches overlook the stylistic characteristics of target documents-their domain-specific phrasing, tone, and structure-which are crucial for matching real-world data distributions. We introduce a retrieval feedback-driven dataset generation framework that automatically identifies failed retrieval cases, leverages large language models to rewrite queries in the style of relevant documents, and verifies improvement through re-retrieval. The resulting corpus of (original, rewritten) query pairs enables the training of rewriter models that are explicitly aware of document style and retrieval feedback. This work highlights a new direction in data-centric information retrieval, emphasizing how feedback loops and document-style alignment can enhance the reasoning and adaptability of RAG systems in real-world, domain-specific contexts.
翻译:检索系统在用户查询与领域文档的语言风格或语义表达存在差异时往往表现不佳。查询重写技术通过将用户查询改写为语义等价形式以弥合这一差距,从而提升检索效果。然而,现有方法大多忽视了目标文档的风格特征——包括领域特定的表述方式、语气和结构——而这些特征对于匹配真实世界的数据分布至关重要。本文提出一种基于检索反馈驱动的数据集生成框架,该框架能够自动识别检索失败的案例,利用大语言模型将查询按照相关文档的风格进行改写,并通过重新检索验证改进效果。由此产生的(原始查询,改写查询)配对语料库可用于训练明确感知文档风格与检索反馈的查询改写模型。本研究为以数据为中心的信息检索指明了新方向,强调了反馈循环与文档风格对齐如何增强RAG系统在真实世界特定领域场景中的推理与适应能力。