FineFilter：面向检索增强大语言模型的细粒度噪声过滤机制 (FineFilter: A Fine-grained Noise Filtering Mechanism for Retrieval-Augmented Large Language Models)

Retrieved documents containing noise will hinder Retrieval-Augmented Generation (RAG) from detecting answer clues, necessitating noise filtering mechanisms to enhance accuracy.Existing methods use re-ranking or summarization to identify the most relevant sentences, but directly and accurately locating answer clues from these large-scale and complex documents remains challenging. Unlike these document-level operations, we treat noise filtering as a sentence-level MinMax optimization problem: first identifying the potential clues from multiple documents using contextual information, then ranking them by relevance, and finally retaining the least clues through truncation. In this paper, we propose FineFilter, a novel fine-grained noise filtering mechanism for RAG consisting of a clue extractor, a re-ranker, and a truncator. We optimize each module to tackle complex reasoning challenges: (1) Clue extractor firstly uses sentences containing the answer and similar ones as fine-tuned targets, aiming at extracting sufficient potential clues; (2) Re-ranker is trained to prioritize effective clues based on the real feedback from generation module, with clues capable of generating correct answer as positive samples and others as negative; (3) Truncator takes the minimum clues needed to answer the question (truncation point) as fine-tuned targets, and performs truncation on the re-ranked clues to achieve fine-grained noise filtering. Experiments on three QA datasets demonstrate that FineFilter significantly outperforms baselines in terms of performance and inference cost. Further analysis on each module shows the effectiveness of our optimizations for complex reasoning.

翻译：检索到的文档若包含噪声，将阻碍检索增强生成（RAG）系统检测答案线索，因此需要噪声过滤机制以提升准确性。现有方法通过重排序或摘要技术识别最相关的句子，但直接从大规模复杂文档中精准定位答案线索仍具挑战性。不同于这些文档级操作，我们将噪声过滤建模为句子级MinMax优化问题：首先利用上下文信息从多篇文档中识别潜在线索，继而按相关性排序，最终通过截断保留最精简的线索集。本文提出FineFilter——一种面向RAG的新型细粒度噪声过滤机制，包含线索提取器、重排序器与截断器。我们针对复杂推理任务优化各模块：（1）线索提取器以包含答案的句子及其相似句作为微调目标，旨在充分提取潜在线索；（2）重排序器根据生成模块的真实反馈训练，优先选择能生成正确答案的有效线索作为正样本，其余作为负样本；（3）截断器以回答问题所需的最少线索（截断点）为微调目标，对重排序后的线索执行截断以实现细粒度噪声过滤。在三个问答数据集上的实验表明，FineFilter在性能与推理成本方面显著优于基线方法。对各模块的进一步分析验证了优化策略对复杂推理任务的有效性。