Large Language Models (LLMs) can exhibit considerable variation in the quality of their sampled outputs. Reranking and selecting the best generation from the sampled set is a popular way of obtaining strong gains in generation quality. In this paper, we present a novel approach for reranking LLM generations. Unlike other techniques that might involve additional inferences or training a specialized reranker, our approach relies on easy to compute pairwise statistics between the generations that have minimal compute overhead. We show that our approach can be formalized as an extension of self-consistency and analyze its performance in that framework, theoretically as well as via simulations. We show strong improvements for selecting the best k generations for code generation tasks as well as robust improvements for the best generation for the tasks of autoformalization, summarization, and translation. While our approach only assumes black-box access to LLMs, we show that additional access to token probabilities can improve performance even further.
翻译:大语言模型(LLMs)在采样输出的质量上可能表现出显著差异。对采样结果进行重新排序并选择最佳生成结果,是提升生成质量的常用方法。本文提出了一种新颖的LLM生成结果重新排序方法。与需要额外推理或训练专用重新排序器的其他技术不同,我们的方法仅依赖于生成结果之间易于计算的成对统计量,计算开销极低。我们证明该方法可形式化为自洽性的一种扩展,并从理论和仿真角度分析其在该框架下的性能。实验表明,该方法在代码生成任务中能显著提升最佳k个生成结果的选择效果,并在自动形式化、摘要生成和翻译任务中对最佳生成结果的选择展现出稳健提升。尽管我们的方法仅需对LLM进行黑盒访问,但我们发现额外获取token概率可进一步提升性能。