ReFeed: Retrieval Feedback-Guided Dataset Construction for Style-Aware Query Rewriting

Retrieval systems often fail when user queries differ stylistically or semantically from the language used in domain documents. Query rewriting has been proposed to bridge this gap, improving retrieval by reformulating user queries into semantically equivalent forms. However, most existing approaches overlook the stylistic characteristics of target documents-their domain-specific phrasing, tone, and structure-which are crucial for matching real-world data distributions. We introduce a retrieval feedback-driven dataset generation framework that automatically identifies failed retrieval cases, leverages large language models to rewrite queries in the style of relevant documents, and verifies improvement through re-retrieval. The resulting corpus of (original, rewritten) query pairs enables the training of rewriter models that are explicitly aware of document style and retrieval feedback. This work highlights a new direction in data-centric information retrieval, emphasizing how feedback loops and document-style alignment can enhance the reasoning and adaptability of RAG systems in real-world, domain-specific contexts.

翻译：检索系统在用户查询与领域文档的语言风格或语义表达存在差异时往往表现不佳。查询重写技术通过将用户查询改写为语义等价形式以弥合这一差距，从而提升检索效果。然而，现有方法大多忽视了目标文档的风格特征——包括领域特定的表述方式、语气和结构——而这些特征对于匹配真实世界的数据分布至关重要。本文提出一种基于检索反馈驱动的数据集生成框架，该框架能够自动识别检索失败的案例，利用大语言模型将查询按照相关文档的风格进行改写，并通过重新检索验证改进效果。由此产生的（原始查询，改写查询）配对语料库可用于训练明确感知文档风格与检索反馈的查询改写模型。本研究为以数据为中心的信息检索指明了新方向，强调了反馈循环与文档风格对齐如何增强RAG系统在真实世界特定领域场景中的推理与适应能力。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【博士论文】用于搜索的 Transformer 模型：检索、鲁棒性与拒绝机制

专知会员服务

10+阅读 · 2月8日

【博士论文】半结构化表格数据上的信息检索

专知会员服务

24+阅读 · 2025年9月7日

【WWW2024】多模态查询建议：基于人类反馈的多智能体强化学习

专知会员服务

22+阅读 · 2024年2月8日