LLM-Confidence Reranker: A Training-Free Approach for Enhancing Retrieval-Augmented Generation Systems

Large language models (LLMs) have revolutionized natural language processing, yet hallucinations in knowledge-intensive tasks remain a critical challenge. Retrieval-augmented generation (RAG) addresses this by integrating external knowledge, but its efficacy depends on accurate document retrieval and ranking. Although existing rerankers demonstrate effectiveness, they frequently necessitate specialized training, impose substantial computational expenses, and fail to fully exploit the semantic capabilities of LLMs, particularly their inherent confidence signals. We propose the LLM-Confidence Reranker (LCR), a training-free, plug-and-play algorithm that enhances reranking in RAG systems by leveraging black-box LLM confidence derived from Maximum Semantic Cluster Proportion (MSCP). LCR employs a two-stage process: confidence assessment via multinomial sampling and clustering, followed by binning and multi-level sorting based on query and document confidence thresholds. This approach prioritizes relevant documents while preserving original rankings for high-confidence queries, ensuring robustness. Evaluated on BEIR and TREC benchmarks with BM25 and Contriever retrievers, LCR--using only 7--9B-parameter pre-trained LLMs--consistently improves NDCG@5 by up to 20.6% across pre-trained LLM and fine-tuned Transformer rerankers, without degradation. Ablation studies validate the hypothesis that LLM confidence positively correlates with document relevance, elucidating LCR's mechanism. LCR offers computational efficiency, parallelism for scalability, and broad compatibility, mitigating hallucinations in applications like medical diagnosis.

翻译：大型语言模型（LLM）已彻底变革了自然语言处理领域，但在知识密集型任务中产生的幻觉仍是关键挑战。检索增强生成（RAG）通过整合外部知识来解决这一问题，但其效能取决于准确的文档检索与排序。尽管现有重排器展现出有效性，它们通常需要专门训练、带来高昂计算成本，且未能充分利用LLM的语义能力——尤其是其固有的置信度信号。我们提出LLM-置信度重排器（LCR），这是一种无需训练、即插即用的算法，通过利用源自最大语义聚类比例（MSCP）的黑盒LLM置信度，来增强RAG系统中的重排性能。LCR采用两阶段流程：首先通过多项采样与聚类进行置信度评估，随后基于查询和文档置信度阈值进行分箱与多级排序。该方法在优先处理相关文档的同时，为高置信度查询保留原始排序，确保了鲁棒性。在BEIR和TREC基准测试中，使用BM25和Contriever检索器进行评估，LCR——仅采用7-90亿参数的预训练LLM——在预训练LLM和微调Transformer重排器上，持续将NDCG@5提升高达20.6%，且无性能下降。消融研究验证了LLM置信度与文档相关性正相关的假设，阐明了LCR的作用机制。LCR具备计算高效性、可并行扩展的架构以及广泛的兼容性，能够有效缓解如医疗诊断等应用中的幻觉问题。