Evaluation and ranking of large language models (LLMs) has become an important problem with the proliferation of these models and their impact. Evaluation methods either require human responses which are expensive to acquire or use pairs of LLMs to evaluate each other which can be unreliable. In this paper, we provide a novel perspective where, given a dataset of prompts (viz. questions, instructions, etc.) and a set of LLMs, we rank them without access to any ground truth or reference responses. Inspired by real life where both an expert and a knowledgeable person can identify a novice our main idea is to consider triplets of models, where each one of them evaluates the other two, correctly identifying the worst model in the triplet with high probability. We also analyze our idea and provide sufficient conditions for it to succeed. Applying this idea repeatedly, we propose two methods to rank LLMs. In experiments on different generative tasks (summarization, multiple-choice, and dialog), our methods reliably recover close to true rankings without reference data. This points to a viable low-resource mechanism for practical use.
翻译:随着大型语言模型(LLMs)的激增及其影响力的扩大,对其进行评估与排名已成为重要问题。现有评估方法要么依赖成本高昂的人工反馈,要么采用LLM两两互评方式,而后者可靠性存疑。本文提出全新视角:在仅提供提示数据集(如问题、指令等)与一组LLM的情况下,无需任何真值或参考回复即可实现模型排名。受现实场景中"专家与博学者均能识别新手"的启发,核心思路是构建模型三元组(triplets),使每个模型能评估其余两者,并以高概率正确识别组内最差模型。我们对该思路进行了理论分析,提出了其成功的充分条件。基于该思路的重复应用,我们设计了两种LLM排名方法。在摘要生成、多项选择与对话等不同生成任务的实验中,本方法无需参考数据即可可靠恢复近乎真实的排名,为低资源场景下的实际应用提供了可行方案。