Evaluating the output of generative large language models (LLMs) is challenging and difficult to scale. Most evaluations of LLMs focus on tasks such as single-choice question-answering or text classification. These tasks are not suitable for assessing open-ended question-answering capabilities, which are critical in domains where expertise is required, such as health, and where misleading or incorrect answers can have a significant impact on a user's health. Using human experts to evaluate the quality of LLM answers is generally considered the gold standard, but expert annotation is costly and slow. We present a method for evaluating LLM answers that uses ranking signals as a substitute for explicit relevance judgements. Our scoring method correlates with the preferences of human experts. We validate it by investigating the well-known fact that the quality of generated answers improves with the size of the model as well as with more sophisticated prompting strategies.
翻译:评估生成式大语言模型(LLM)的输出具有挑战性且难以规模化。现有的大多数LLM评估集中于单项选择问答或文本分类等任务。这些任务不适用于评估开放式问答能力,而在需要专业知识的领域(如健康领域)中,开放式问答能力至关重要,因为误导性或错误答案可能对用户的健康产生重大影响。使用人类专家评估LLM答案的质量通常被视为黄金标准,但专家标注成本高昂且耗时。我们提出了一种评估LLM答案的方法,该方法利用排序信号替代显式的相关性判断。我们的评分方法与人类专家的偏好具有相关性。我们通过验证一个已知事实来确认该方法的有效性:生成答案的质量随着模型规模的增大以及更复杂的提示策略而提升。