Although Large Language Models (LLMs) have demonstrated strong performance on a wide range of tasks, they still face reliability challenges such as hallucination. Previous studies reveal that highly capable LLMs like GPT-4 are effective in judging the reliability of individual responses, while less capable ones are often tuned to evaluate the relative reliability of responses to the same query. To enable less capable LLMs to effectively judge the reliability of individual responses, we propose a novel method named $\textit{Meta}$ $\textit{Ranking}$ (MR). Unlike previous methods, which assess the response directly, we achieve the judgement by comparing the target query-response pair with reference query-response pairs. We found its remarkable effectiveness in error detection for LLM responses on reasoning tasks, where less capable LLMs could outperform strong baselines, even without fine-tuning. We further demonstrate that MR can be used to enhance the performance of LLMs in two practical applications: query routing and iterative training data filtering. The former achieves GPT-4-turbo comparable performance with less than half the token consumption, while the latter makes the instruction-tuned LLaMA-7B and Phi-2, a 2.7B model, significantly surpass Alpaca-13B over fewer training samples, underscoring the high potential of our proposed method.
翻译:尽管大型语言模型(LLMs)在广泛任务上展现了强大性能,但其仍面临幻觉等可靠性挑战。先前研究表明,GPT-4等能力较强的LLM能有效判断单个响应的可靠性,而能力较弱的模型通常被调优用于评估同一查询下多个响应的相对可靠性。为使能力较弱的LLM能够有效判断单个响应的可靠性,我们提出了一种名为$\textit{元排名}$(Meta Ranking, MR)的新方法。与直接评估响应的传统方法不同,我们通过将目标查询-响应对与参考查询-响应对进行比较来实现判断。研究发现,该方法在LLM推理任务的错误检测中效果显著,即使未经微调,能力较弱的LLM也能超越强基线模型。我们进一步证明,MR可增强LLM在两项实际应用中的性能:查询路由和迭代训练数据过滤。前者以不到一半的token消耗实现了与GPT-4-turbo相当的性能,后者则使经过指令微调的LLaMA-7B和2.7B参数的Phi-2模型在更少的训练样本下显著超越Alpaca-13B,充分凸显了我们所提出方法的巨大潜力。