In a real-world RAG system, the current query often involves spoken ellipses and ambiguous references from dialogue contexts, necessitating query rewriting to better describe user's information needs. However, traditional context-based rewriting has minimal enhancement on downstream generation tasks due to the lengthy process from query rewriting to response generation. Some researchers try to utilize reinforcement learning with generation feedback to assist the rewriter, but these sparse rewards provide little guidance in most cases, leading to unstable training and generation results. We find that user's needs are also reflected in the gold document, retrieved documents and ground truth. Therefore, by feeding back these multi-aspect dense rewards to query rewriting, more stable and satisfactory responses can be achieved. In this paper, we propose a novel query rewriting method MaFeRw, which improves RAG performance by integrating multi-aspect feedback from both the retrieval process and generated results. Specifically, we first use manual data to train a T5 model for the rewriter initialization. Next, we design three metrics as reinforcement learning feedback: the similarity between the rewritten query and the gold document, the ranking metrics, and ROUGE between the generation and the ground truth. Inspired by RLAIF, we train three kinds of reward models for the above metrics to achieve more efficient training. Finally, we combine the scores of these reward models as feedback, and use PPO algorithm to explore the optimal query rewriting strategy. Experimental results on two conversational RAG datasets demonstrate that MaFeRw achieves superior generation metrics and more stable training compared to baselines.
翻译:在现实世界的检索增强生成(RAG)系统中,当前查询常包含口语化省略和对话上下文中的模糊指代,因此需要进行查询重写以更准确地描述用户的信息需求。然而,传统的基于上下文的重写方法由于从查询重写到响应生成的流程较长,对下游生成任务的提升有限。部分研究者尝试利用强化学习结合生成反馈来辅助重写器,但这些稀疏奖励在多数情况下提供的指导不足,导致训练不稳定且生成结果不佳。我们发现,用户需求同样反映在标准文档、检索文档和真实答案中。因此,通过将来自这些多维度的密集反馈应用于查询重写,可以获得更稳定且令人满意的响应。本文提出了一种新颖的查询重写方法 MaFeRw,该方法通过整合来自检索过程和生成结果的多维度反馈来提升 RAG 性能。具体而言,我们首先使用人工标注数据训练一个 T5 模型以初始化重写器。接着,我们设计了三个指标作为强化学习反馈:重写后查询与标准文档的相似度、检索排序指标以及生成结果与真实答案之间的 ROUGE 分数。受 RLAIF 启发,我们针对上述指标训练了三种奖励模型以实现更高效的训练。最后,我们将这些奖励模型的评分组合作为反馈,并利用 PPO 算法探索最优的查询重写策略。在两个对话式 RAG 数据集上的实验结果表明,与基线方法相比,MaFeRw 在生成指标上表现更优,且训练过程更为稳定。